4 views

1 Answers

Table extraction is the process of recognizing and separating a table from a large document, possibly also recognizing individual rows, columns or elements.It may be regarded as a special form of information extraction.

Table extractions from webpages can take advantage of the special HTML elements that exist for tables, e.g., the "table" tag,and programming libraries may implement table extraction from webpages.The Python pandas software library can extract tables from HTML webpages via its read_html function.

More challenging is table extraction from PDFs or scanned images, where there usually is no table-specific machine readable markup.Systems that extract data from tables in scientific PDFs have been described.

Wikipedia presents some of its information in tables, and, e.g., 3.5 million tables can be extracted from the English Wikipedia.Some of the tables have a specific format, e.g., the so-called infoboxes.Large-scale table extraction of Wikipedia infoboxes forms one of the sources for DBpedia.

4 views