Controlled generation of synthetic corpora for NLP evaluation
Résumé
Automatic processing is mandatory to build a global and fair view of opinions and sentiments expressed on the web through comments and reviews. Various Extracting Tools (ETs) exists to automatically analyse comments and reviews; however checking the accuracy of such tools remain quite challenging. We propose a new approach for that purpose. The main idea is to use a data-to-text approach to generate a synthetic corpus which can be used to validate ETs. The data represent what has to be said in which proportion about something (i.e: 45% of the review says the room is small). A set of reviews (the synthetic corpus) is then generated and the correctness of an ET can then be assessed in regards to its fairness regarding the original data.
Origine | Fichiers produits par l'(les) auteur(s) |
---|