A social media-based dataset of typhoon disasters, 2017

减灾研究历史数据集专题 II 区论文（已发表） • 版本 EN5 Vol 3 (2) 2018

Yang Tengfei, Xie Jibo, Li Guoqing

DOI: 10.11922/scdata.2017.0014.en
: 10.11922/sciencedb.547

： 2017 - 12 - 14

： 2018 - 01 - 17

： 2018 - 05 - 14

14963 71 0

Abstract & Keywords

Abstract: Typhoons are a category of natural disasters whose annual occurrence causes major life and property loss in the Northwestern Pacific region. During typhoon events, social media serve as an effective tool to transmit and acquire disaster information in real time. Texts and photos from social media can be used as a way of crowd sourcing to extract disaster loss information, analyze human behaviors and formulate responses. The dataset presented here consists of social media-based data collected from "Sina-Weibo" microblogs, "WeChat" articles, and "Baidu" news about the typhoon events in 2017, covering Typhoon "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar". We mainly collected text data from these social media platforms and websites, which were then cleaned for redundancy and irrelevance. This dataset can be used for deeper disaster information mining of typhoon events.

Keywords: typhoon; social media; disaster reduction; data mining

Dataset Profile

Chinese title	2017年台风灾害社交媒体数据集
English title	A social media-based dataset of typhoon disasters, 2017
Data corresponding author	Xie Jibo (xiejb@radi.ac.cn)
Data authors	Yang Tengfei, Xie Jibo, Li Guoqing
Time range	2017
Geographical scope	15°N – 30°N, 101°E – 132°E; specific areas include: southeast China and surrounding area
Data volume	1.70 GB (9749 texts from "Baidu" news and "WeChat" Subscription; 9601 records from "Sina-Weibo")
Data format	.html, .xls, .sql
Data service system	<http://www.sciencedb.cn/dataSet/handle/547>
Sources of funding	National Key R&D Program of China (2016YFE0122600); International Partnership Program of Chinese Academy of Sciences（131C11KYSB20160061）
Dataset composition	This dataset consists of two compressed (ZIP) files, which are "Data.zip" and "Classification example.zip". Among them, "Data.zip" is made up of eight subfolders, which are "Haitang", "Hato", "Khanun", "Mawar", "Merbok", "Nesat", "Pakhar", and "Roke". Social media data are stored in these subfolders in different formats, which include .html, .xls and .sql. "Classification example.zip" is made of seven subfolders which represent seven large categories of disaster losses, respectively. Each subfolder contains a few subfolders which represent small categories under corresponding large categories. These data are saved in XLS format. Data.zip: ● XLS file: Texts from social media are stored in XLS format in a structured form. ● SQL file: Users can execute the SQL file in their own MySQL database to import the data which contain structured texts from social media. ● HTML file: It is used to store original web pages retrieved from "Baidu" news and "WeChat" Subscription. Classification example.zip: ● XLS file:It is used to store data of disaster loss. Each file corresponds to a specific category of disaster loss.

1. Introduction

Typhoons cause major losses to human life and property each year in the Northwestern Pacific region. How to quickly collect information and make reasonable responses is an urgent problem faced by disaster relief departments. Crowd sourcing and citizen observation has been an effective method to obtain disaster information, among which social media, in particular represented by Twitter,¹ Facebook,² micro-blog data,³ etc., provide near real-time information during the disaster period. By making full use of the dynamic information collected by social media, the disaster relief department can get timely information about the disaster events and people's responses to them. Research has been done on the mining of disaster information based on social media data. Evidence shows that people's behavior is greatly influenced by social media when disasters occur.⁴ A study commissioned by the American Red Cross⁵ found that more than half of the respondents believed that government agencies should monitor social media to acquire timely and effective disaster information. As to how to use social media data to mine valuable disaster information, Chae J et al.⁶used Twitter data for hurricane disaster analysis, and the results provided support for government departments' policy decision-making. Some studies^7,8 built disaster event classifiers based on microblog data for disaster event identification, which detected disasters through citizen observation. In addition, achievements have been made in the spatio-temporal analysis of disaster,^9,10 the characteristics of disaster social responses,¹¹ and the prediction simulation of disaster trends,^12,13 etc., which greatly improved the efficiency of disaster relief.

Collecting useful information for disaster events from social media is quite time-consuming and complicated due to unstructured expression. Although some social media platforms provide the API (Application Program Interface) for public information access, they also set restrictions to limit the information we can acquire. For example, we can't get the micro-blog information that relates to a specific disaster event; nor can we get the micro-blog information on a specified historical period directly through API. In other words, the API of these platforms does not provide corresponding retrieval functions, which undoubtedly increases the workload of subsequent data processing. Therefore, in our research project, we develop a toolkit to automatically harvest and process social media-based disaster information. We use the toolkit to generate a typhoon disaster dataset for 2017 based on several social media platforms. The dataset is mainly composed of text data that come from "Sina-Weibo" microblogs, "WeChat" Subscription and "Baidu" news. Figure 1 shows typhoon disaster data from "Sina-Weibo". The data contain textual descriptions and pictures of the disasters, as well as the time and location of data upload. It provides data support for the disaster relief departments to understand the timely progress of the disaster.

Figure 1 Disaster information from "Sina-Weibo" microblogs

2. Data collection and processing

2.1 Overview

The dataset records information on the following eight typhoon events: "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar" (Table 1).

Table 1 The list of typhoons in 2017

No.	Name	Landfall time
1	Merbok	2017/6/12
2	Roke	2017/7/23
3	Nesat	2017/7/29
4	Haitang	2017/7/30
5	Hato	2017/8/23
6	Pakhar	2017/8/27
7	Mawar	2017/9/3
8	Khanun	2017/10/16

The data from "WeChat" Subscription and "Sina-Weibo" are mostly from unofficial media and public uploads, which mainly describe the progression of a disaster based on public observation. In order to give a more comprehensive understanding of the disaster, we added data from Baidu news which were released by official media, which mainly contained disaster loss statistics, reliefmeasures, etc. We used different methods to obtain data from varied data sources. Among them, keyword search was used to retrieve data from "WeChat" Subscription and "Baidu" news. For example, when "Typhoon Hato 2017" was entered, the "Baidu" search engine returned the news related to "Typhoon Hato" in 2017. The toolkit we developed was used to conduct the search and to automatically generate relevant contents. Then, we parsed and cleaned these texts and stored them into the database in a structured form. The same method was used to obtain data from "WeChat" Subscription. For "Sina-Weibo", we used the advanced search function of the platform to obtain data related to the typhoon events. According to the track of the typhoon events (Figure 2), we selected the name of the Typhoon plus the characters "台风 (Typhoon)" as the keywords for setting retrieval conditions.

Figure 2 Tracks of the typhoon events in 2017

Source: "Tianditu" (http://map.tianditu.com/).

2.2 Data collection process

We developed a social media data harvesting system with functions of data collection, parsing, cleaning, and management, as shown in Figure 3. We acquired data from different platforms by using the collection module, and then parsed them into a structured form. The HTML pages from "WeChat" Subscription and "Baidu" news were stored in their original HTML format. Cleaning the data involved a process that comprised removing duplicated information, translating traditional Chinese into simplified Chinese, translating full-width characters into half-width characters, etc. Finally, these data were stored in a structured form. The structure of the data is shown in Table 2.

Figure 3 Flowchart of the social media data harvesting system

Table 2 Structure of the data

File(.zip)	Folder	Folder	File(.xls, .sql, .html)	Notes
Data.zip	baidu	Haitang Hato Khanun Mawar Merbok Nesat Pakhar Poke	.xls .sql .html	.html: Users can parse the page themselves according to their research needs. .sql: User can execute the SQL file in their own MySQL database to import the data into it. .xls: Users can use the data directly through the XLS file.
	wechat
	weibo		.xls .sql

2.3 Data classification

Social media data contain a lot of disaster loss information, and different types of damage may be included in the same data. For example, a text from "Sina-Weibo" writes, "After the typhoon, many trees were blown down and many cars were smashed." The text contains disaster loss information about the destruction of trees and cars and we divided these information into different categories of disaster losses. Below we provide a classification example according to the type of reported damage caused by the disaster. The raw data in this classification example are all from "Sina-Weibo" microblogs related to typhoon "Hato" in Zhuhai. Users can classify the rest of the data in the dataset by referring to the classification example or according to their specific needs in research. The seven large categories include social effects, forestry, fisheries, traffic, electric power, communication and infrastructure damage. One large category contains several small categories, as shown in Figure 4. For example, the category of social effects contains injuries and deaths, water shortage, building damage, and market shutdown. The classification example is shown in Table 3.

Figure 4 Category of disaster loss

Table 3 An example of disaster classification

Large category	Small categories	Number of posts
Social effects	Injuries and deaths	12
	Water shortage	258
	Building damage	78
	Market shutdown	3
Forestry	Destruction of trees and plants	119
Fisheries	Loss of fishing ground	1
Fisheries	Damage of fishing boats	1
Traffic	Traffic congestion	101
Traffic	Vehicle damage	38
Electric power	Electric powercutoff	287
Electric power	Damage of electric power equipment	4
Communication	Interruption of networks and signals	123
Infrastructure damage	Damage of street lamps, billboards, bridges, roads, and so on	34

3. Sample description

Data fields for "Sina-Weibo" includes ID, keyword, province, city, content, picture, location, release time, platform, number of forwards, comments, number of likes, as shown in Table 4. Each column has a limit of no more than 140 characters. The topics of the dataset include property loss, traffic impact, casualties, power supply, communication impact, rescue arrangements, response measures, and public attitudes toward the typhoon, among others.

Table 4 Data from "Sina-Weibo"

ID	210
Keyword	Typhoon
Province	Guangdong Province
City	Zhuhai City
Content	After the typhoon, Mr. Liu asked me out for a walk to experience the post-disaster Zhuhai. Almost no restaurant was open. Having looked for a long time, finally we found a restaurant which was open. We saw so many cars smashed, trees blown down, and yachts blown ashore. My little white car was scratched by the branches. How can I go to work tomorrow, since Hengqin is so far away? The last picture, as a tribute to our soldiers!
Picture	http://ww2.sinaimg.cn/square/005WuHsBgy1fiu0v3h5b8j30qo0zkdvg.jpg ; http://ww4.sinaimg.cn/square/005WuHsBgy1fiu0ul1m9aj30qo0zktks.jpg ; http://ww3.sinaimg.cn/square/005WuHsBgy1fiu0wqzd5nj30qo0zk4ap.jpg ; http://ww4.sinaimg.cn/square/005WuHsBgy1fiu0y2re2bj30qo0z
Location	Zhuhai
Release time	2017-08-23 22:25
Platform	iPhone 7
Number of forwards	–
Comments	–
Number of likes	1

Data fields for "Baidu" news include ID, title, link, source, release time, and keyword, as shown in Table 5. The fields for "WeChat" Subscription include ID, title, content, source, release time, and keyword, as shown in Table 6. The themes of the data include typhoon tracks, disaster loss statistics, government announcements, emergency measures, etc.

Table 5 Data from "Baidu" news

ID	51
Title	95 thousand people in Fujian to be relocated under Typhoon "Nesat"and"Haitang"
Link	http://www.huaxia.com/xw/dlxw/2017/07/5415198.html
Source	huaxia.com
Release time	2017-07-31, 15:11
Keyword	Typhoon Haitang 2017

Table 6 Data from "WeChat" Subscription

ID	31
Title	Typhoon "Haitang" has come!Weihai has become a sea!
Content	Typhoon "Haitang" has come!Weihai has become a sea!)
Source	Neurologist [神经科专家] ( name of a WeChat account)
Release time	2017-08-04
Keyword	Typhoon Haitang 2017

4. Quality control and assessment

Keywords related to the designated typhoon event were diversified and optimized to ensure maximum retrieval of related information from each social media platform. After data collection was completed, we manually checked the validity of the data, and removed incomplete entries as well as entries irrelevant to the typhoon disaster. In addition, we established a database index system to avoid duplicate data. For disaster classification, three colleagues were arranged to classify these original data to ensure the accuracy of the final classification results. Prior to this, classification standards had been set up to minimize possible discrepancies. Finally, we randomly sampled 500 data entries from each platform and found an accuracy rate of nearly 100%.

5. Value and significance

To our knowledge, there were no social media-based datasets for these typhoons before, and our dataset effectively fills up this gap. The data in our dataset can be analyzed to meet different needs of disaster research. For example, the disaster loss data presented here can be re-classified into different categories to support real-time evaluations of disaster losses. The data can also be used for further analysis of typhoon disasters such as victims’ sentiment analysis in the typhoon area, the extraction of buzzwords during typhoon transits, etc. In follow-up studies, we have used the texts in this dataset to train the corpus for automatic identification of typhoon disaster information, which achieved satisfactory results.

Acknowledgments

This work is supported by the National Key R&D Program of China (2016YFE0122600). We thank Edward T.-H. Chu, Associate Professor at National Yunlin University of Science and Technology, Taiwan, China for his advice on data collection. We thank Li Zhenyu from Shandong University of Science and Technology and Dr. Tian Chuanzhao from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences for their careful examination of our dataset.

Sakaki T, Okazaki M & Matsuo Y. Twitter analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge & Data Engineering 25 (2013): 919 – 931.

+ CSCD · Baidu Scholar

Bird D, Ling M & Haynes K. Flooding Facebook – the use of social media during the queensland and Victorian floods. Australian Journal of Emergency Management 27 (2012): 27 – 33.

+ CSCD · Baidu Scholar

Wang YD, Li H, Wang T et al. Emergency information mining and analysis of emergency based on social media. Journal of Wuhan University 41 (2016): 290 – 297.

+ CSCD · Baidu Scholar

National Research Council (U.S.). Public Response to Alerts and Warnings Using Social Media: Report of a Workshop on Current Knowledge and Research Gaps. Washington, DC: The National Academies Press, 2013.

+ CSCD · Baidu Scholar

American Red Cross. Social media in disasters and emergencies. Available at: <http://i.dell.com/sites/content/shared-content/campaigns/en/Documents/red-cross-survey-social-media-in-disasters-aug-2010.pdf> [Accessed December 11, 2017].

+ CSCD · Baidu Scholar

Chae J, Thom D, Yun J et al. Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.

+ CSCD · Baidu Scholar

Zhou Y, Yang L, Walle BVD et al. Classification of microblogs for support[ing] emergency responses: Case Study [of] Yushu Earthquake in China, 2014. Proceedings of the 47th Hawaii International Conference on System Sciences, 2013: 1553 – 1562.

+ CSCD · Baidu Scholar

Qu Y, Huang C, Zhang P et al. Microblogging after a major disaster in China: a case study of the 2010 Yushu earthquake. Proceedings of ACM Conference on Computer Supported Cooperative Work, 2011: 25 – 34.

+ CSCD · Baidu Scholar

Chae J, Thom D, Jang Y et al. Special section on visual analytics: Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.

+ CSCD · Baidu Scholar

10.

Chen Z, Gao T, Luo NX et al. Social media effectiveness to reflect the spatial and temporal distribution of natural disasters. Science of Surveying and Mapping 42 (2017): 44 – 48.

+ CSCD · Baidu Scholar

11.

Liu HB & Zhai GF. A comparative study of the social response characteristics of different disasters based on social media information. Journal of Catastrophology 32 (2017):187 – 193.

+ CSCD · Baidu Scholar

12.

Stoové MA & Pedrana AE. Making the most of a brave new world: Opportunities and considerations for using Twitter as a public health monitoring tool. Preventive Medicine 63 (2014): 109 – 111.

+ CSCD · Baidu Scholar

13.

Velardi P, Stilo G, Tozzi AE et al. Twitter mining for fine-grained syndromic surveillance. Artificial Intelligence in Medicine 61 (2014): 153 – 163.

+ CSCD · Baidu Scholar

Data citation

1. Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. Science Data Bank. DOI: 10.11922/sciencedb.547

稿件与作者信息

How to cite this article

Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. China Scientific Data 3 (2018), DOI: 10.11922/scdata.2017.0014.en

Yang Tengfei

social media data collection and analysis, writing.

PhD student, research area: natural language processing, disaster information mining.

Xie Jibo

motivation of the research, writing.

xiejb@radi.ac.cn

PhD, Associate Professor, research area: geospatial data infrastructure, remote sensing, geo-computation.

Li Guoqing

advice on dataset design and data check, writing.

PhD, Professor, research area: geospatial data infrastructure, remote sensing, big data.

National Key R&D Program of China (2016YFE0122600)

出版历史

I区发布时间：2018年1月29日（版本EN3）

II区出版时间：2018年5月14日（版本EN4）

最近更新时间：2018年5月14日（版本EN5）