Crawling JavaScript websites using WebKit - with application to analysis of hate speech in online discussions

Author(s)

Publication date

2013-11

Series/Report no

NIK: Norsk Informatikkonferanse;2013

Publisher

Tapir Akademisk Forlag

Document type

Abstract

JavaScript Client-side hidden web pages (CSHW) contain dynamic material created as a result of specific user activities. The number of CSHW websites is increasing. Crawling the so-called Hidden Web is challenging, particularly when JavaScript CSHW from an external website is seamlessly included as part of the web pages. We have developed a prototype web crawler that efficiently extracts content from CSHW. The crawler uses WebKit to render web pages and to emulate human web page activities to reveal dynamic content. The WebKit crawler was used to collect text from 39 Norwegian online newspaper debate articles, where the online user discussions were included as JavaScript CSHW from other websites. The average speed to extract the main content and the JavaScript-generated discussions were 36.3 kB/sec and 8.8 kB/sec, respectively. Analyzing the collected text from the news paper debate articles using opinion mining, documents that the debate articles are more positive to Islam and Muslims than the following discussions. The results demonstrate the importance of being able to collect such JavaScript CSHW discussion content to get an overview of existing hate speech on the Internet

Keywords

Permanent URL (for citation purposes)

  • http://hdl.handle.net/10642/1834