Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm

Saloun P., Andrsic D., Cigankova B., Anagnostopoulos I.

dc.creator	Saloun P., Andrsic D., Cigankova B., Anagnostopoulos I.	en
dc.date.accessioned	2023-01-31T09:53:28Z
dc.date.available	2023-01-31T09:53:28Z
dc.date.issued	2020
dc.identifier	10.1109/SMAP49528.2020.9248454
dc.identifier.isbn	9781728159195
dc.identifier.uri	http://hdl.handle.net/11615/78741
dc.description.abstract	A common task in a world of natural language processing is text classification useful for e.g.spam filters, documents sorting, science articles classification or plagiarism detection. This can still be done best and most accurately by human, on the other hand, we can of ten accept certain error in the classification in exchange for its speed. Here, natural language processing mechanism transforms the text in natural language to a form understandable by a classifier such as K-Nearest Neighbour, Decision Trees, Artificial Neural Network or Support Vector Machines. We can also use thishuman element to help automated classification to improve its accuracy by means of crowdsourcing. This work deals with classification of text documents and its improvement through crowdsourcing. Itsgoal is to design and implement text documents classifier prototype based on documents similarityand to design evaluation and crowdsourcing-based classification improvement mechanism. For classification the N-grams algorithm has been chosen, which was implemented in Java. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful. We have tested our approach on two data sets with promising preliminary results even across different languages. This led to a real-world implementation started at the beginning of 2019 in cooperation of two universities: VšB-TUO and OSU. © 2020 IEEE.	en
dc.language.iso	en	en
dc.source	SMAP 2020 - 15th International Workshop on Semantic and Social Media Adaptation and Personalization	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85097609319&doi=10.1109%2fSMAP49528.2020.9248454&partnerID=40&md5=a93fd78b1b54b9a67a3bb45191c6ebb5
dc.subject	Computational linguistics	en
dc.subject	Crowdsourcing	en
dc.subject	Decision trees	en
dc.subject	Information retrieval systems	en
dc.subject	Natural language processing systems	en
dc.subject	Nearest neighbor search	en
dc.subject	Neural networks	en
dc.subject	Semantics	en
dc.subject	Social networking (online)	en
dc.subject	Statistical tests	en
dc.subject	Support vector machines	en
dc.subject	Text processing	en
dc.subject	Automated classification	en
dc.subject	Classification accuracy	en
dc.subject	Design and implements	en
dc.subject	Improvement mechanism	en
dc.subject	K-nearest neighbours	en
dc.subject	NAtural language processing	en
dc.subject	Real-world implementation	en
dc.subject	Text document classifications	en
dc.subject	Classification (of information)	en
dc.subject	Institute of Electrical and Electronics Engineers Inc.	en
dc.title	Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm	en
dc.type	conferenceItem	en

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Mostrar el registro sencillo del ítem