AttributesValues
type
value
  • A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rates in an initially clean IR collection. Our results showed that significant impacts are noticed starting at a 5% error rate. Furthermore, stemming has proven to make systems more robust to errors.
Subject
  • Computer-related introductions in 1993
  • Unicode
  • Information retrieval
  • Artificial intelligence applications
  • Computational linguistics
part of
is abstract of
is hasSource of
Faceted Search & Find service v1.13.91 as of Mar 24 2020


Alternative Linked Data Documents: Sponger | ODE     Content Formats:       RDF       ODATA       Microdata      About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data]
OpenLink Virtuoso version 07.20.3229 as of Jul 10 2020, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (94 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software