Chances and Challenges of Computational Data Gathering and Analysis

Digital and social media and large available data-sets generate various new possibilities and challenges for conducting research focused on perpetually developing online news ecosystems. This paper presents a novel computational technique for gathering and processing large quantities of data from Facebook. We demonstrate how to use this technique for detecting and analysing issue-attention cycles and news flows in Facebook groups and pages. Although the paper concentrates on a Finnish Facebook group as a case study, the demonstrated method can be used for gathering and analysing large sets of data from various social network sites and national contexts. The paper also discusses Facebook platform regulations concerning data gathering and ethical issues in conducting online research.


Introduction
Alongside the digital revolution and development of virtual environments, social scientific research is experiencing a paradigm shift towards computational approaches (Chang, Kauffman, and Kwon 2014, 67). Traditional qualitative and quantitative methods of social sciences appear to be limited in studying the phenomena of such rapidly altering environments as the internet or social media (Karpf 2012, 646). As social science research begins to use data-driven methods and novel tools for data gathering and analysis, a multidisciplinary approach is increasingly common (Chang, Kauffman, and Kwon 2014, 70). According to Steensen and Ahva (2014), we are living through the "fourth wave" of research about digital journalism, which following normative, empirical and constructivist waves, emerged mainly because of new practices related to social media. While arguing there is a need to reassess theories, they do not, surprisingly, identify any similar need for the new methodologies.
Digital Journalism, 2016 Vol. 4, No. 1, 55-74, http Internet and social media entail various platforms and social network sites (SNS) which have not yet been either sufficiently harnessed for research or have had their potential discussed, especially from the perspective of data gathering and analysis for social scientific or specifically journalism research. For example, Facebook generates a hybrid digital communication and news ecosystem where issues rise and fall in newsfeeds and on specific groups and pages, and news flows are continuously created by the users themselves, including sharing news generated by online news media.
Consequently, internet and social media enable the generation and access of massive data-sets ("big data") as sources for research material (Bollier 2010). Although there is a lack of common agreement and clarity concerning the definition of "big data" (cf. Boyd and Crawford 2012;Ukkonen 2013), the term has been widely adapted by media and journalism scholars (Couldry and Powell 2014;Lewis 2015;Lewis and Westlund 2015). Large digital data-sets used in data journalism are also being referred to as "open data" "that can be freely used, re-used and redistributed by anyone" (Coddington 2015;Mair et al. 2013;Open Knowledge Foundation 2012;Parasie 2015). Conventional journalism research benefits from using "big data" as a replacement for traditional representative sampling (Couldry and Powell 2014, 1). The current trend towards ubiquitous and mobile communications naturally increases the variety of "big data". The challenge is how to channel these data streams into knowledge, journalism or research (Lewis and Westlund 2015, 4).
The objective of this paper is to demonstrate the building procedure and possibilities of a new computational approach for gathering and processing online "big data" and to show how it can be used to detect issue-attention cycles and news flows in Facebook groups and pages. We use a case example from Finnish Facebook Community page "Valio Out of Fennovoima's Nuclear Plant Project" (2014). This is a group protesting against co-operation between the food-producing company Valio and the nuclear power company Fennovoima in building a new nuclear power plant in Finland. Facebook has, in Finland, become the main channel of the public's online activities and news consumption. A substantial majority of the population-about 86 per cent-regularly uses the internet, and Facebook is used by 95 per cent of all Finnish SNS users (Statistics Finland 2014). The case, as a topical issue in Finnish public debate, is particularly interesting from the journalism study perspective and will be used for detecting how the data gathered from Facebook can be used to compare and find synergy between the public's attention waves and media influence and operations.

Online News and "Issue-attention Cycles"
Online news is characterized as a revolutionary hybrid news medium (e.g. Allan 2006;Kautsky and Widholm 2008;Widholm 2015 ), which affects the role of journalists as mediators and moderators of information (Hermida et al. 2012) as well as changes the audience's behaviour. Social media platforms, such as Facebook and Twitter, form a digital news hybrid with traditional news media services. Facebook is acknowledged as one of the primary vehicles for news flows and exposure to news (Baresch et al. 2011;Bell 2014). News, via Facebook, is nowadays a mixture of Facebook users' posts, Facebook groups' and (fan) pages' posts, as well as advertisers' messages.
Consequently, in contrast to passive audiences, people can now also be considered as mediators and gatekeepers of news online. However, the audience's attention cannot be taken for granted. "Instead of a traditional push-model, users are free to navigate between sites to seek the information they desire and select their own versions of the daily news" (Weber andMonge 2011, 1063). Thus, in contrast to the traditional procedure of newspapers gathering their audiences, Baresch et al. (2011, 2) refer to "a new kind of news consumption strategy, a new kind of consumer, a 'stumbler' so to speak, who gets nearly all his or her news through incidental or socially selected exposure". Online news sites and portals fiercely compete to be the quickest at catching the audience's attention, thus publishing breaking news almost in real time. The same news criteria as in traditional media largely become irrelevant, since the virtual space is practically unlimited. Social media largely serves as a distributor of the news, but also as an independent news source. The dissemination of issues is unpredictable and uncontrolled, as everyone can at any time "share" a link, or news, to any number of SNS and other platforms. In addition, algorithms produce and distribute news online that can also be redistributed by social media users.
Moreover, the attention span of the media and the public on an issue is not unlimited. The cyclical character of public attention was noticed already in the pre-digital media era. More than 40 years ago, Anthony Downs outlined a concept of "issue-attention cycles" characterizing the rising and fading of public attention and concern towards major societal issues (Downs 1972). Downs (1972, 39-40) suggests five stages of "issue-attention cycles", characteristic of the American society and media of the time: (1) the pre-problem stage, where only some experts or interest groups are aware of the problem; (2) alarmed discovery and euphoric enthusiasm, where the public becomes both aware of and alarmed about the evils of a particular problem; (3) realizing the cost of significant progress, which includes major sacrifices by large groups of population; (4) gradual decline of intense public interest and enthusiasm; (5) the post-problem stage, where the issue moves into a "twilight realm of lesser attention or spasmodic recurrences of interest". In the digital environment, issues rise and decline rapidly, but the attention cycles are not necessarily shorter than in traditional media as Anderson, Brossard, and Scheufele (2012) demonstrate.
In the pre-internet age, the traditional media governed the issue-attention cycle. The structure of the news flows in the traditional media is based on newsworthiness and the space the topical stories get in print and broadcasting outlets. However, before the era of computational approaches and data-driven methods, detecting and explaining the cyclical nature of public attention and major issues, which appear in the combined news flows of the digital news ecosystem, was not feasible.

Approaches to Computational Data Gathering for Research
Using semi-public data (data collected with a user account) to retrieve data for Facebook group and page content analysis research with a computational approach such as the one presented in this paper, i.e. a tool using Facebook's own application programming interfaces (API) to gather all available communication activity data from COMPUTATIONAL DATA GATHERING AND ANALYSIS the platform's pages and groups and organizing it into a warehouse, is still very rare in social sciences. Semenov (2013) discusses many aspects of social media data analysis, implementation and repository designed for monitoring communities on social media sites (see also Semenov, Veijalainen, and Boukhanovsky 2011). In their working paper, Zlatanov and Koleva (2014) are using a software application called Opinion Crawler designed to extract data from open Facebook groups and use it for data analysis through people centric models and text network analysis connected to online originated protests. Nevertheless, the objectives of these studies are quite different from this paper's; they do not explain the software design specifics nor data organization procedures of Facebook data per se and are not done from the journalism study perspective.
Indeed, many Facebook data collection applications and technical platforms are available online for researchers (e.g. Digital Footprints, NodeXL, Netvizz, RFacebook, SocialMediaMineR and Facepager). For example, Netvizz (2015) is a ready-made application tool for retrieving data from Facebook, which asks for target users' permission to access their public profile and friend lists. Though quite similar in providing lists of posts and their likes and comments, in contrast to our approach it relies on the researcher to be a member of a group or liking a page, and is dependent on the tool creator's decisions on output data. The tool provides less data and analysis possibilities compared to independent data retrieval and building one's own data warehouse.
Another more technical and nearly identical data tool to ours in its data retrieval method is Facepager (2015), which fetches publicly available data from Facebook, Twitter and other platforms with an open standard format for transmitting data (JSON-based API) and stores it in a database. This tool can gather all the Facebook data, but in an unorganized form. Our approach is to transfer the data into the data warehouse in a specific model schema, which helps to organize and analyse the data.

Facebook and Data Collection
Privacy settings greatly impact on the results of data gathering on Facebook, especially the information the user has decided to hide from others (Giglietto, Rossi, and Bennato 2012). For example, analysing newsfeeds on Facebook could turn out to be more difficult than originally expected due to the general notion of only sharing the timeline with a limited list of "friends". Instead, focusing the research on Facebook pages could improve the results of data collection, because Facebook pages are regarded as public material with no limitations to its internal data (Giglietto, Rossi, and Bennato 2012, 152).
While conducting research on Facebook, it is important not to violate the common principles of the site. The general notion in Facebook's "Principles" and "Statement of Rights and Responsibilities (SRR)" about the data available focuses on the possibility of the users individually determining the information they are willing to share publicly. If users do not limit the availability of information, it becomes public data, and Facebook is not responsible for what information received from the site is used (see Facebook 2014aFacebook , 2014d Generally, access to SNS can be categorized into three data: public data, semipublic data and dark data. Public data can be retrieved from public interfaces without signing a user agreement with a service provider. Semi-public data can be accessed from public interfaces by signing a user agreement and using the user account to retrieve data. Dark data are obtained with techniques that the service provider has not intended for use or are against the user agreement of the service. For example, Facebook provides more data in their Web user interface than from their APIs. This difference has given rise to software tools like "scrapers" that instead of using machine interfaces use interfaces meant for human users to pool more detailed information from the network. Other "dark" tools, like harvesters, robots and spiders, gather data blindly in an attempt to remodel the social network. Facebook's RSS and "Automated Data Collection Terms" (ADCT) dictate terms of using automated tools, such as harvesting bots, robots, spiders or scrapers, and indicate a need to ask Facebook's written permission for using such tools or storing data, and forbid using any acquired data for business or advertisement purposes (Facebook 2014d. Operators of Facebook's own Platform applications or websites, and users of Social Plugins must comply with the "Facebook Platform Policy" (FPP) (Facebook 2014g).
Below we discuss the collection of semi-public data from users and communication data from Facebook open groups and pages, and use the same data from a page in the case study analyses. During the creation of the data-gathering techniques, special emphasis was put on acting in accordance with Facebook's user agreements, principles, and any intent implied in the SRR, FPP and ADCT. Particular attention was paid to developer rules, protection of data, IDs and not selling or reproducing the data, applicable to our computational approach of semi-public data gathering, as we are not developing a specific platform app or website nor using the indicated automated data collection tools.

Building Data-centric Research Approaches for Studying Facebook
We describe here the path that was taken to create a research tool for studying Facebook. Facebook currently offers a readily available user interface data tool and a simple content search system, the Graph API Explorer. It is a software environment created for third-party developers, where it is possible to create applications to access data on Facebook by asking permission from users (Facebook 2014c). Instead of using the Graph API Explorer, we gathered semi-public data from pages and groups by using Facebook's own APIs and public interfaces, which allowed us broader freedom of research and more data to be obtained, still following the general (developer) principles and regulations of Facebook.
We will initially define key concepts and motivations behind using computational data gathering, then explore the possibilities of data organization, warehousing and analysis, and in the following section, demonstrate some basic case examples of the application of research data in studying issue-attention cycles and news flows in Facebook pages. While describing the building process of the tools, we have focused more on concepts and problem solving than on a direct hands-on approach. Our understanding is that conducting data-centric research and building computational tools is not COMPUTATIONAL DATA GATHERING AND ANALYSIS purely technical work-it is more about thinking on what data we can obtain and what its meaning is from the point of research.

Automatic Versus Manual Data Gathering and Some Basic Concepts
The biggest drawback of manual data gathering is that this method is slow and prone to human errors. The Facebook page used in this article as a case example had 966 posts to its feed and 1921 comments attached to these posts. In addition, manual method makes only a limited amount of all data available. While Facebook shows the key information in a Web or mobile client, there is more data available from the system's APIs, such as metadata on how the system handled the content in question. Furthermore, during the lengthy manual data gathering, the information available can change or it can become unavailable. Although computational data gathering does not completely remove the risk of data being changed or going missing, it minimizes the window of opportunity where it can happen.
For an avid user, who has used Facebook public Web or mobile client, the concepts, terms and scopes might be self-evident, but not necessarily from their technical point of view. Depending on what API is in use, there are various terms for the underlying data and restrictions that are invisible for users when browsing the service. Some of the main concepts are explained in Table 1. Action that a user can make to notify the creator of a post or a comment that the user has liked the entry. Everybody who can see the original entry can also see all "likes". Share Act where a user shares a post generated by someone else on their own feed, or posts it to a feed of a friend, group or page. Information about shares on Facebook is restricted. If a user has permission to read the entry, it is possible to read a list of users who shared the article if their privacy settings allow this. User Account of an individual user. Detailed personal information on Facebook is restricted. If the user is not a friend or the user has not chosen to relax their privacy policies, the only information that can be retrieved from the user are first name, last name, gender, age, friend count and subscriber count.

Searching and Retrieving Data from Public Interfaces
Both public interfaces of Facebook, the Graph API and FQL (Facebook Query Language), enable keyword search of selected data in the system. Currently, in version 2.2, it is possible by using the Graph API to make a keyword search according to names of users, pages, events, groups, places and locations (Facebook 2014e). Previously it was also possible to search posts with keyword search but unfortunately this feature is being removed from the system with the introduction of version 2.2 and the deprecation of the 2.0 version of the platform. With the same upgrade, Facebook is also removing support for FQL (Facebook 2014b).
After a search is made (for how, see Facebook 2014e), the system returns a list of matching results. Only users who have not removed themselves from public search are reached. When a researcher finds an interesting source of data and the data are publicly available, the data are retrieved by issuing calls to different endpoints of the Graph API.
There are, however, a few restrictions on which data can be retrieved. First of all, if the researcher is not the owner or administrator of a feed, no data about who has liked the page or group is available. Secondly, the biggest restriction is the API call limit. Currently Facebook only allows 600 calls per 600 seconds per authentication token, i.e. in simple terms one call per second can be issued to Facebook (see Mangobug 2012). While this restriction, for a human user, would not be a limiting one, for a computational data search and retrieval system this is a major hindrance and limitation. Also, while it takes only around 40 seconds to retrieve 1000 posts from a feed, it will take much more time to get other data, such as comments on the posts. In other words, while access to basic items is fast, being able to gather all data takes much longer. Thus, when designing an application to search and retrieve data, researchers should make an effort to classify the kinds of data they want before embarking on building a data storage and warehouse tool.

Push-stream and Pull-stream Views on Data Retrieve
Facebook's owners have access not only to what people do on their pages, but they also have information about who has viewed what and for how long. Such an overall datascape could be called whirlpool, where events external to the social network create internal actions in the system, which then can again create actions outside the network.
The "whirlpool" can be further broken down to two different ways of looking at retrievable data from Facebook. When a user makes a post, comment, like or share, they actually generate a messaging event to initiate different actions in the system. This view of retrievable data could be called the push-stream: user-generated events push the system to perform different actions. Another view of Facebook would be constructed on the basis of what it messages to a user. Individual users have their own event wall or feed that is filled by Facebook from the content generated by a particular users' network of friends and by advertisements displayed by the system itself. This view of the retrievable data could be described best as the pull-stream where individual users pull content from the system by browsing the content of their own or other users' walls.

COMPUTATIONAL DATA GATHERING AND ANALYSIS
However, Facebook sets restrictions on what is possible to retrieve from the system. Logged users can only access their own feeds, feeds of their own network, feeds of the other users who have set their feeds to public, and feeds of open groups and pages. In addition, Facebook does not give any information about individual visits to the content of a feed, besides for administrators of pages. These restrictions alone make it impossible to retrieve pull-stream data. In addition, Facebook does not give detailed information about "shares" and "likes". In the case of "shares", information about who shared and to whom is missing. The only statistic the system gives out is the total number of "shares". In the case of "likes", the system does not give the time of an individual "like". These restrictions hinder retrieving push-stream data, but do not make it completely impossible.
In conclusion, we can see that from the data that Facebook provides to third parties, the push-stream forms the most complete retrievable data. The pull-stream-based view is currently impossible to be reconstructed and the aggregated level is still heavily restricted. However, a limited "whirlpool" view could be constructed by combining push-stream data with aggregated pull-stream data as well as with external sources, such as traditional media. However, the recommendation based on our experience is to concentrate on retrieving data from the push-stream, looking at actions and their actionable effects.

The Necessity of Clarifying Data and Creating a Data Warehouse
When testing the computational data retrieval approach, we noticed multiple practical problems. Firstly, social networks, as other IT-systems, are under constant change where new features are developed and old techniques are terminated. Also, alongside changes in the official APIs, other changes, e.g. the privacy policy, could lead to alterations of available data. Thus, the research subject is a moving target that can suddenly change, in the worst case denying access or changing key functionalities and affecting the research results.
Another problem is that the APIs are intended to extend the functionality of their respected services. They are not designed for data analysis but for the day-to-day system operations. For example, each "post" retrieved via Graph API has 28 attributes that describe it and its relationship to other items and system metadata. Also, there are overall 31 endpoints in the system from where different data can be retrieved. Thus, there are too many different data objects and attributes to handle without any categorization or connection clues.
These problems evoked the idea that the data retrieved should be logically separated and isolated from the data format provided by the source system, and when moved into a different system for analysis, the data should be coded into a simpler form. Thus, a separate data warehouse was coded and constructed to manage the data.
Data warehouse in computing refers to a system that is used for saving data from one or multiple source systems into a single set for retrieving data reports and analysis. There are two different design philosophies regarding data warehousing. The first is the dimensional approach of Ralph Kimball's star schema where measurable and quantitative data are stored in fact tables, whereas descriptive attributes related to the fact data are stored in dimensional tables (Kimball 1996). The second design approach is Bill Inmon's 3NF model (Third Normal Form) where data are structured as much as possible in order to minimize data redundancy through normalization (Inmon 1992). It is also possible to make a combined approach of the two models. Hence, we decided to combine the idea of the star schema (storing the data into separate tables) with data normalization (simplifying and combining similar forms of data into the tables), thus forming a data model that is easy to search and analyse.
The data warehouse of this study was built on the basis of a push-stream-based event data-centric model modelled into a star schema, but structured and normalized as much as possible. "Event" was defined as a time-associated transaction in the system. Due to Facebook's restrictions on accessing data, only "posts" and "comments" could be defined as "events", and "likes" and aggregated data about "shares" were used to describe "events" to which they were tied. The data model used was based on a dimensional approach where "posts" and "comments" would be stored as "events" and all the other information describing and relating to them stored to dimensional tables (see Table 2; Figure 1).
The technical reason not to store "like" data to an "event" table was to make it smaller and simpler. Especially in feeds of groups and pages, every "post" and "comment" has multiple "likes" and thus, the size of the event table would grow immensely. Queries would also become more complex as "likes" would be treated as child events of "posts" and "comments" versus a direct relation linkage via the "like" table.
In addition to simplifying the form of the data, clarification and warehousing also enable using a single set of analytical tools to address multiple social networks via a common data warehouse. Although data from different social networks differ, their functionality in the conceptual level is similar (see Table 3). By identifying similarities between social networks, data from multiple systems can be brought into a single system and transcoded into a single format, enabling researchers to build and use exactly the same analytical tools for both networks.

Possibilities of Analysing Data from the Warehouse
After the decisions and organization of data, the warehouse now stores the data in five tables (see Figure 1): f_events, d_content, d_entity, d_event_type and r_likes. From these tables both the d_content and d_event_type tables describe what was the content of an event, thus together forming the content-centric type of data. Tables r_like and d_entity deal with actors and their relation to events and each other, forming the people-centric type of data. The last type of data is the "event" based, consisting of "posts" and comments made in a feed.
The content-centric data are the most unrestricted from the three types. Content in general can be retrieved without hindrance. Analysis of the content of texts, photos and videos is still the major domain of qualitative research. Content can be, indeed, automatically analysed for example, with sentiment analysis, using word lists to calculate the content of a text, but these techniques are yet under construction and their usefulness is under discussion. However, the most useful function of storing the content along with other feed information is the ability to get the content out quickly and with additional information, such as time and date, number of likes, comments, users tied to a specific content, and so forth.
The people-centric data are a rather limited type, not only due to Facebook's restrictions to personal data, but because the user information available is tied to usergenerated events. If a person has read a post or a comment but not done any other activity, there is no trace of that user. Thus, the user information is limited only to people who at least once were active in a feed. The information always available about the users includes first name and last name, used language version and user's profile picture. Gender information is available in 99.9 per cent of cases, friend count in 54.7  per cent and affiliation information in 2.9 per cent of cases. This is essentially all the information that Facebook gives about users directly, but it does enable making comparisons based on gender and on amount of friends a user has. The event-centric data enable more data to be gathered and analysed. In a context of a feed, one can track a user's total number of "posts", "comments", "likes" and "shares", and commented, liked and shared posts and comments. Further information can be generated by taking into account the type of content and time of a post or comment in question. The problem with the event-centric data, as noted before, is that Facebook does not give information about individual shares and about the time when a "like" was made. Thus, analysis of the event data is analysis of actions and their responses. For example, a "like" is always a response, while a "comment" and a "post" can be both actions and responses to other posts and comments. With this data, we can generate a view of what has happened, what are the responses to it, and by combining this data with data about content and users, we can start to explain behaviour of a feed.

Using the Computational Approach in Journalistic Research: Case Facebook
To demonstrate potentials of Facebook data and its organization for journalistic research, we chose a page of a protest group called "Valio Out of Fennovoima's Nuclear Plant Project" (2014) with 3095 followers. The page was searched from Facebook with Graph API, all its available semi-public data were retrieved and directed to the organized data warehouse, which automatically saved the data to the assigned tables.
In the following, we use the concept of the "issue-attention cycle" for demonstrating the potentials and possibilities of computational data gathering. Our aim is not to attempt to explain exhaustively the ebb and flow of public attention on the issue of building a nuclear power station in Finland and all the activities of the respective Facebook group. Instead, we are trying to show how to discover and visualize the waves of attention using large data-sets, and indicate some possible basic ways of analysing them.

Detecting Issue-attention Cycles
Groups and pages formed to support or protest against an issue make it possible for journalism and media researchers to observe issue-attention cycles of certain societally topical issues. Our example consists of the above-mentioned group's data since its birth, from week 32 of 2011 to week 45 of 2014.
Focusing on the page's event-centric data of "posts" and related communication activity, such as shares, likes and comments, it is possible to observe the intensity of the group's attention. Table 4 shows a data table example that contains weekly downloads of the page's "posts" and post-related activities. It is also possible to take a daily download of "posts" for even more specific cycle evaluations. Table 4 firstly shows the year, month and week of the "posts", then the total amount of posts grouped by time, and also by their producer: post by source indicating the page/administrator of the page as the source, post by other indicating any other actor as the source, and total number of posters, i.e. amount of individual producers of the posts. The table also shows the same total amounts and grouped information of post-related shares, likes, comments and comment likes.  By categorizing the activity information according to each producer, one may find some interesting aspects. For example, in this short data excerpt we can see that in total 103 "posts" were made on the page wall, and most of them by the page/administrator, i.e. source (N = 89, 86 per cent). In addition, there are no shares made by the group "members" (i.e. other), only by the page and with large quantities, and nearly the same applies to "likes". This may give an initial indication of the group's internal dynamics and objectives.

COMPUTATIONAL DATA GATHERING AND ANALYSIS
From the issue-attention cycle perspective, the total amounts of posts, shares, likes and comments (see Table 4) give a good overall picture of the communication activities of a Facebook group. Figure 2 visualizes the overall data-set with a specific focus on total amounts of "posts" and their tied activities of shares, likes and comments, showing a synopsis of the amounts from approximately every other week from years 2011-2014.
By looking at Figure 2, we can make observations of the issue-attention cycle in the social media context. The total data figure shows how attention has been relatively steady and low-level immediately after the launch of the group (during 2011-2012) and only towards the end of 2013 and in 2014 have activities been boosted. The early phases of the formation of the group may reflect the "pre-problem" stage, as the initiators of the group must have been people who had enough information to be worried about the issue. Alongside the growing attention of the traditional media (close to the end of 2013), when Fennovoima made a deal with the Russian nuclear energy corporation Rosatom (Taloussanomat 2013), the group's activities reach the "alarmed discovery" stage. The activity increases even further at the beginning of 2014. At this time, media attention focused on the fact that the Fennovoima's plant project had, due to funding problems, been transformed into a Russian project (e.g. YLE 2014). During the "alarmed discovery" stage, the group actively shares information, which also gets significantly large quantities of "shares" and "likes".
To be able to give an exhaustive explanation of the actual reasons of the activity peaks, their relation with the public agenda and the group's inner development, qualitative analysis is also necessary. The most lucrative way of starting qualitative analysis is by uploading content-centric data of posts from the warehouse for content analysis. This would, for example, reveal the specific content of the posts and comments that created the high activity peaks of weeks 12-14, 2014. The quantitative data enabling activity visualization is, nevertheless, a good foundation for any further analysis.

News Flows as Escalators of Attention
Communication of groups and pages on Facebook includes links to online news articles and contents of other social media and websites, as well as other users' comments, shares and "likes" on the links, which all are components of social media's news flows. Examining these news flows by tracking, for example, articles on particular topics, changing patterns of the consumption and production of news can be described, as well as issue-attention cycles explained.
Focusing on content-centric data of the links attached to the "posts" shows what news articles or other content, and from which sources, have been linked. In addition, the event-centric data of related comments, likes and shares show how the news links generate activity and thus escalate their impact and create new news flows. Table 5 shows a snapshot of one way of retrieving and organizing the data table that is formable by this focus from the warehouse. The complete data table includes 121 links to sources retrieved from the group's wall since the time the group was formed, from week 32 of 2011 to week 45 of 2014.
More specifically, by looking at the first source column row "www.facebook.com", the second column referred indicates the total amount that a link from the source "www.facebook.com" has been posted on the page (N = 222). The third column referrers indicates the total number of actors posting the link (N = 9), and the next three columns categorize the referred information according to the specific producing actor: users, i.e. individual people (N = 6), Facebook page (N = 3) and Facebook group (N = 0). The data table's final six columns show the like count, i.e. total number of likes made on the "Facebook" source links (N = 3661), their total share count (N = 1279) and total comment count (N = 1038), and the average frequencies each link has been liked (average), shared (average) and commented on (average).
The data output includes a lot of specific producer information, which offers interesting perspectives for interpretation, but also gives the basic total amounts of used "links" and their likes, shares and comment counts. The overall data offer vast possibilities of counting and correlations. For example, one simple option is to start by counting news source links in comparison to other used link sources (e.g. entertainment) to evaluate media influence, proceeding to the comparison of different media sources. Our study's results show that the most linked source on the page has been Facebook itself (N = 222), the second most used source has been YLE (Finnish National Public Service Broadcasting Company) (N = 57) and the third was Kaleva (a daily newspaper) (N = 40). Interestingly, Finland's most popular and authoritative quality newspaper, Helsingin Sanomat, appears only as the fifth source.
In addition, by focusing on the total amounts of attention and activity the posted links have generated, i.e. likes, shares and comments, one can measure the general impact of the links. Figure 3 shows the four most linked news sources and their escalated activity.  One could continue by making more complex analyses and models of the way the amounts and content of posted news links affect the way they have gained attention among the public (likes, comments) and get to be forwarded to new publics (shares).
The described approach offers various possibilities for setting the research focus and analysing the data. Quantitative analysis can be combined with qualitative analysis, for example, by retrieving content-centric data with a focus on time and full content of the links, as the links can be opened and qualitatively analysed. When looking at the cycles of issue attention, one might, for example, choose links from the high-attention period, study their journalistic framing and narratives, and compare them to the low-attention periods.

Conclusions and Discussion
This paper reflects the paradigm shift in social scientific research towards a computational approach in data gathering and analysis. Our main conclusion is that the use of innovative research techniques of internet studies and analysis of large data-sets in studying digital and social media enormously widens the scope and quality of the information that social scientists can have at their disposal. The traditional methods of social sciences, such as surveys, various ways of text analysis and interviews, etc., are not sufficient for studying the new kinds of information streams of the new hybrid digital news ecosystem that combines online media, social media and other digital sources of information/news. They remain limited also in researching the consumers and producers acting in this ecosystem, as they, too, become increasingly combined (often called as "prosumers"). The altering subjects and foci of research also require enlargement of the variety of information available for researchers. In this article, we introduced a new option for gathering and processing large data-sets for studying attention and information flows on SNS, specifically Facebook. Much of the data Facebook contains are not freely accessible, but there are healthy amounts of open data, for example on Facebook pages and open groups. One way of computationally accessing such data is to search and gather semi-public data by using Facebook's own APIs and public interfaces. This method allows more data to be obtained and freedom of data organization compared to specific applications asking permission or other ready-made online tools. In the approach we emphasized isolating the data from the source system and coding it to a simpler form in a separate data warehouse. Before building a data warehouse model, it is nevertheless important to understand the aspects of the available data and their connections, which are presented in this paper. Understanding the technical aspects is also helpful in retrieving and planning data analysis (beyond the case examples of this paper).
The case example analysis in this paper was kept quite simple and limited to the "issue-attention cycle" of one topical issue in the Finnish public debate and link flows on the group page. The analysis showed how large data-sets can be used to present time-scales of high and low waves of online activity and attention on an issue. By comparing them to real societal happenings and established media content analysis, various cycles and explanations of issue-attention are possible to outline. In addition, the process allows for tracking news flows on social media and their impact among the public both by quantitatively comparing how much shared news gets attention online and qualitatively by analysing, for example, comments on shared news articles. What news and how it gains the public's attention and becomes forwarded in social media provide journalists and media houses with valuable information regarding online news consumption and news flows.
The same data-gathering and processing technique can be used for studying various other journalistic aspects both quantitatively and qualitatively, including the integration of other SNS. For example, comparing traditional media content analysis with the online public's societal or political issue attention and framing may give new insight into who sets the agenda in todays' society-online publics or the media? In addition, the data can be used to detect how media use online topics as sources compared to the online public finding their topics from traditional/online media.
The ethical rules and views of collecting and using online data for research still remain under debate. However, by abiding by general laws and the rules of the platform or the SNS under scrutiny, using case-by-case reflection, and securing the anonymity and safety of individuals and data, a researcher should be able to conduct ethically acceptable online research.

DISCLOSURE STATEMENT
No potential conflict of interest was reported by the authors.