Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology

Sarencheh, S.; Potdar, Vidyasagar; Yeganeh, E.; Firoozeh, N.

dc.contributor.author	Sarencheh, S.
dc.contributor.author	Potdar, Vidyasagar
dc.contributor.author	Yeganeh, E.
dc.contributor.author	Firoozeh, N.
dc.contributor.editor	David Taniar
dc.contributor.editor	Osvaldo Gervasi
dc.contributor.editor	Beniamino Murgante
dc.contributor.editor	Eric Pardede
dc.contributor.editor	Bernady O Apduhan
dc.date.accessioned	2017-01-30T11:23:35Z
dc.date.available	2017-01-30T11:23:35Z
dc.date.created	2011-03-22T20:01:30Z
dc.date.issued	2010
dc.identifier.citation	Sarencheh, Saeed and Potdar, Vidyasagar and Yeganeh, Elham and Firoozeh, Nazanin. 2010. Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology, in Taniar, D. and Gervasi, O. and Murgante, B. and Pardede, E. and Apduhan, B.O. (ed), Lecture Notes in Computer Science, Volume 6017: Computational science and its applications - ICCSA 2010, pp. 370-382. Germany: Springer.
dc.identifier.uri	http://hdl.handle.net/20.500.11937/11240
dc.description.abstract	Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 100 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web.
dc.publisher	Springer
dc.subject	Information extraction
dc.subject	Forum
dc.title	Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
dc.type	Book Chapter
dcterms.source.startPage	370
dcterms.source.endPage	382
dcterms.source.title	Lecture notes in computer science, volume 6017: computational science and its applications - ICCSA 2010
dcterms.source.isbn	9783642121647
dcterms.source.place	Heidelberg
dcterms.source.chapter	46
curtin.department	Centre for Extended Enterprises and Business Intelligence
curtin.accessStatus	Fulltext not available

Files in this item

Name:: 154802_13792_PUB-CBS-EEB-MC-51 ...
Size:: 261.7Kb
Format:: PDF

This item appears in the following Collection(s)

Curtin Research Publications

Show simple item record

Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology

Files in this item

This item appears in the following Collection(s)

Related items