Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
MetadataShow full item record
Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 100 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web.
Showing items related by title, author, creator and subject.
Increased in synthetic cannabinoids-related harms: Results from a longitudinal web-based content analysisLamy, F.; Daniulaityte, R.; Nahhas, R.; Barratt, Monica; Smith, A.; Sheth, A.; Martins, S.; Boyer, E.; Carlson, R. (2017)Â© 2017 Elsevier B.V. Background Synthetic Cannabinoid Receptor Agonists (SCRA), also known as â€œK2â€? or â€œSpice,â€? have drawn considerable attention due to their potential of abuse and harmful consequences. More ...
Coll, Sandhya Devi (2015)This thesis reports on an inquiry on enhancing students’ learning experiences outside school (LEOS) using digital technologies. The inquiry took the nature of an ethnographic case study which was conducted over a year. ...
MacCallum, Diana; Khan, Shahed (2012)In 2009, Curtin University made an in-principle commitment to a ‘re-life’ project for Building 201, which houses its School of Built Environment. The project, Build 201.1, was to represent a major overhaul of the building’s ...