This webpage engages with the experiements and results of a six month student team project on the topic of "Table Annotation using Deep Learning" at the School of Business Informatics and Mathematics of the University of Mannheim under the supervision of Keti Korini and Christian Bizer.
We have conducted a set of experiments around the use of transformer language models for table annotation, based on WDC Schema.org Table Annotation Benchmark (SOTAB)[1]. There are two multi-class classification tasks: Column Type Annotation (CTA) and Columns Property Annotation (CPA). CTA involves predicting the column type, while CPA involves identifying semantic column relationships between different column pairs in a table. All of our experiments and code are available in our GitHub repository.
This work is organized as follows. Firstly, in Chapter 1, we will introduce and motivate the task of table annotation. Furthermore, in Chapter 2, we will provide an overview of the table annotation tasks. Subsequently, in Chapter 3, we will present the related work including, Transformer model. Moving on to Chapter 4, we will describe the models, preprocessing steps, and serialization approaches and hyperparameter tuning. Additionally, in Chapter 5, we will describe the experiment setup, results and error analysis. Finally, in Chapters 6, we will discuss the results and conclude the work.
Contents
- 1. Introduction
- 2. Theoretical Background
- 3. Related Work
- 4. Methodology
- 5. Experimental Approach and Evaluation
- 6. Discussion and Conclusion
- 7. References
1. Introduction
Table annotation is the task of annotating a table with terms/concepts from knowledge graph, database schema or vocabulary. This process is crucial for data management including data quality control, data discovery, and schema matching process [7], especially since the web contains millions of tables available on websites, public data portals, and encyclopedia such as Wikipedia. However, to utilize these tables, we need to understand their structure and schema, which can be challenging due to their heterogeneous or unknown headers and content. The objective of this project is to address the aforementioned challenges by proficiently utilizing diverse deep learning methods for both Column Type Annotation and Columns Property Annotation tasks.
2. Theoretical Background
This Chapter provides an overview of the theoretical underpinnings, frameworks as well as specific information relevant to follow our work. For this reason, we start by introducing the main tasks we are trying to solve.
2.1 Column Type Annotation (CTA)
Column Type Annotation (CTA) is a critical task in data integration and knowledge discovery [4] that involves assigning labels to each column in a table based on the types of entities it contains such as "hotel/name", "streetAddress","addressLocality", "Country", and "currency" [1] as shown in Figure 1. While data types provide basic information about the type of data stored in a column, entity types capture domain-specific semantics that convey more meaning. For instance, labeling a column with "hotel/name" rather than just "String" provides valuable information about the specific type of data stored in that column, which is crucial for downstream tasks such as data analysis and retrieval [16].
An example of CTA: An e-commerce website may have a table of products with columns such as "product name", "description", "price", "category", and "brand". By annotating the columns with entity types such as "product name", "brand", and "category", a machine learning model can better understand the semantic meaning of the data, and accurately categorize products into appropriate categories, recommend related products, and provide a more personalized shopping experience to the user.
CTA can be approached as either a multi-class or a multi-label classification problem. In multi-class, each column is annotated with only one label representing its type, while in multi-label, each column is annotated with multiple labels.
2.2 Columns Property Annotation (CPA)
CPA is a web data integration task and it plays a crucial role in providing understanding in the semanntics of a table for better table understanding [7]. It aims to identify semantic relationships between different columns in a table and can infer relationships that might not be immediately obvious. As illustrated in Figure 1, CPA can help to infer that the "address" column refers to the hotel's location.
One common way to tackle CPA is to view it as a multi-class classification task, and it is also known as column relation annotation or relation extraction in various works.

3. Related Work
Chapter 3 provides an overview of relevant literature on table annotation tasks and a Transformer-based language model, which is currently the state-of-the-art approach for these tasks. We also describe the datasets used to evaluate our approaches.
3.1 Transformer
Transformers are a type of neural network architecture that has become widely used in natural language processing and other machine learning applications. They were first introduced in 2017 by Vaswani et al [5] and have since become one of the most popular deep-learning models [17]. Transformers are based on the concept of self-attention, which allows the model to focus on different parts of the input sequence at each step.
The key innovation of transformers is self-attention, a mechanism that allows the model to selectively focus on different parts of the input sequence. Self-attention works by computing a weighted sum of the input sequence at each position, where the weights are determined by a similarity function between the current position and all other positions. The output of the self-attention layer is a sequence of weighted sums, which can then be fed into subsequent layers of the network. The attention function is formalized as:
\[Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\]
where Q, K, and V are the query, key, and value vectors, respectively.
A Transformer-based language model is ideal for the tasks due to its language understanding capabilities and Transformer's attention mechanism that captures contextualized column representations.
3.2 TURL
The paper [4] provides a cornerstone for table representation through the introduction of a framework for table understanding called TURL. TURL achieves table understanding by learning deep contextualized representations of relational tables utilizing unsupervised pre-training on around 570K relational tables from Wikipedia. The pre-training data is constructed based on the WikiTable corpus [18], a corpus containing around 1.65M tables extracted from Wikipedia pages.
The universal architecture of TURL facilitates its application in various downstream tasks, requiring only minimal task-specific modifications. For example, the TURL framework can be used to infer valuable information about tables such as column types and the relationships between columns (CTA and CPA). The framework, as shown in Figure 2,is composed of three main components: an embedding layer that first converts input tables into embeddings, an N-stacked aware transformer layer that attends over these embeddings to capture useful information, and a projection layer for pre-training objectives. In addition to a Masked Language Model (MLM) used by standard BERT language model [14] that can learn representations for tokens in table metadata, TURL proposes a Masked Entity Recovery (MER) objective to learn entity cell representations. Lastly, to perform the pre-training, TURL leverages a pre-trained TinyBERT model [19] and pre-trained the model for 80 epochs.
3.3 TUTA
TUTA, as introduced in [8], represents the first transformer-based model for the CTA and CPA tasks. Notably, TUTA demonstrates the importance of incorporating structure-aware mechanisms for table representation tasks. It captures the spatial and hierarchical information of tables through a tree-based attention and position mechanism. TUTA inspired us to incorporate similar preprocessing and structural-based approaches in our work.
3.4 DODUO
DODUO, as described in [7], is the curent state-of-the-art approach for the CTA and CPA tasks. This approach utilizes a Transformer-based language model and takes the entire table as input to predict column types and relationships between columns. What sets DODUO apart to its predecessors is its ability to annotate table columns using only the information contained in the table itself, without the need for external knowledge or context.
As shown in Figure 3, DODUO utilizes a pre-trained Transformer-based language model, particularly a BERT language model [14] and adopts multi-task learning into the model to "transfer" shared knowledge between CTA and CPA tasks. DODUO introduces table-wise serialization to incorporate table context into prediction. Particularly, the serialization method is: for a table of N columns of m rows, \[serialize(T)::= [CLS]\;v^1_1\;...[CLS]\; v^1_n...v^n_m\;[SEP]\] where a [CLS] token is appended to the beginning of each column, which its corresponding embeddings represent learned column representations for the column. The output layer on top of a column embedding, i.e. the [CLS] token, is used for the CTA task. Column embeddings of a column pair is used for the CPA task instead.
3.5 TaBERT
TaBERT [20] is a pretrained language model built on top of BERT that jointly learns representations for natural language (NL) sentences and (semi-)structured tables. The contributions of TaBERT include linearization of the structure of tables to be compatible with Transformer-based BERT model, content snapshots to encode subset of relevant table content to cope with large tables, and vertical attention mechanism that share information across table rows. Lastly, TaBERT is pretrained on a corpus of 26 million tables and English paragraphs to capture the association between tabular data and related NL text.
The overall architecture for TaBERT is shown in Figure 4. Firstly, a content snapshot of a table is created based on the input NL utterance. Row-wise encodings are produces for utterance tokens and cells. Lastly, all row-wise encodings are aligned and processed by vertical self-attention layers to generate utterance and column representations. These representations are then used for downstream tasks.
3.6 Dataset
We evaluated our solutions primarily on the WDC Schema.org Table Annotation Benchmark (SOTAB)[1] benchmark dataset. To ensure the generalizability of our approach, we test our best serialization approaches on the WikiTables dataset by TURL[4]. The SOTAB benchmark is a collection of tables extracted from the Schema.org Table Corpus, which is maintained by the Data and Web Science Research Group at the University of Mannheim. The corpus comprises over 4.2 million web relational tables covering 43 Schema.org classes. The dataset defines both CTA and CPA labels for tables from 17 schema.org classes/domains. The authors obtained the CPA labels from Schema.org terms used as column headers in a table. The CTA label for a column is instead derived from its CPA label using the Schema.org vocabulary definition. For the CTA task, the SOTAB dataset provides annotations for 162,351 columns from 59,548 tables, with a label space of 91 type labels. For the CPA task, it provides annotations for 174,998 column pairs from 48,379 tables, with a label space of 176 property labels. We used the provided train/valid/test splits for both tasks.
The WikiTables dataset [4] is a collection of tables extracted from the WikiTable corpus [18], which contains over 1.54 million tables extracted from Wikipedia pages. The dataset defines both CTA and CPA labels. To derive the labels for both tasks, the authors referred to Freebase [21] to obtain semantic types and relations respectively. For each column, the CTA labels are the common types of its entities. For each object column in a table, it is paired with the subject column to identify the relations shared by more than half of the entity pairs in the two columns and thus function as the CPA labels. For the CTA task, the dataset provides annotations of 628,254 columns from 397,098 tables, with a label space of 255 type labels for training, and 13,025 (13,391) columns from 4,764 (4,844) tables for test and validation, respectively. For the CPA task, it provides annotations of 62,954 column pairs from 52,943 tables, with a label space of 121 relation labels for traning, and 2,072 (2,175) column pairs from 1,467 (1,560) tables for test (validation). The statistics of WikiTable dataset are shown in Table 1.
CTA | CPA | |||||
---|---|---|---|---|---|---|
mean | median | mode | mean | median | mode | |
Train columns | 1.58 | 1 | 1 | 2.18 | 2 | 2 |
Val columns | 2.76 | 3 | 3 | 2.39 | 2 | 2 |
Test columns | 2.73 | 3 | 3 | 2.41 | 2 | 2 |
4. Methodology
Chapter 4 describes the approaches introduced in this work. We give an overview of data preprocessing and augmentations performed, followed by serialization methods to account for table content, and hyperparameter tuning.
4.1 Preprocessing and Augmentation
To accommodate the token limit of 512 tokens for BERT-like language models, we trim cell values for longer textual data to fit more cells from a table. Following Wang et al. [8]'s empirical studies, we retain at most 8 tokens from a cell. This is because long textual strings often introduce noise and disrupt the structure of a table input. Additionally, we experiment with different threshold cell lengths, i.e., median and mean cell length computed per column basis. The rational being a long textual column contains strings of different length.
To enhance our data processing, we perform data augmentation (DA) at both the cell and table level during the earlier stages of our workflow. Specifically, we apply standard textual DA techniques, such as word swapping, word replacement, word deletion, and column cell value shuffling, at the cell level. At the table level, we perform random cell deletion and column cell shuffling.
4.2 Serialization
A Transformer-based language models expect token (text) sequences as inputs. This translates to representing tables in text sequences, formally known as table serialization, for the tasks of CTA and CPA. Single-column serialization is an intuitive solution that concatenates column values into a sequence to feed into the language model. Specifically, as defined by Suhara et al. [7], for column C with column values v1,...,vm, the serialized sequence is [7]: \[serialize_{single}(C)::= [CLS]\;v_1\;...\;v_m\;[SEP]\]
Table serialization conveniently translates the CTA and CPA tasks into sequence classification and sequence-pair classification task respectively. However, the major drawback of single-column serialization is that it treats each column in a table as an independent sequence, neglecting the crucial table context that defines a real-world entity for a relational table. Previous works [7,9,10,11] have highlighted the importance of table context for the CTA task. At the other end of the spectrum for table serialization is DODUO, which represents a table as an aggregation of contextual information from all column values.
In this work, we employ two distinct serialization techniques - Neighboring Column serialization and TaBERT serialization. In the following, we present an explanation of these techniques, followed by an example for both serialization as shown in Figure 7.
(i) Neighboring Column: The Neighboring Column Serialization technique, illustrated in Figure 5, utilizes adjacent columns as local context in a relational table. Previous research by SATO [11] demonstrated that incorporating local context aids in resolving semantic type ambiguities in single-column predictions. To accomplish this, their approach incorporates the predicted semantic types of immediately adjacent columns as local context for a column prediction model of neural network. For our approach, we utilize neighboring columns of varying window sizes along with the target column in the CTA task. Furthermore, for the CPA task, we incorporate neighboring columns for both the main and target columns.
To build on the Neighboring Column Serialization approach, we introduce a more complex variation called Neighboring Column with Summary Serialization. This approach leverages results from exploratory data analysis (EDA) to extract relevant column value data. We use the pandas-profiling library [12] to infer column data types and analyzes text and context using common statistical metrics such as mean, maximum, and minimum text length. We prepend the statistical results to the data for each column, providing a syntax summary for lengthy columns/tables as part of the local context.
However, for the CPA task, incorporating neighboring columns for both the main and target columns translates to doubling the number of total columns included in the serialization as compared to CTA. Therefore, we transfer the Neighboring Column Serialization technique from the CTA task to the CPA task, without including the main column, which we refer to as the No-Main Neighboring Column Serialization.

(ii) TaBERT: The TaBERT Serialization, illustrated in Figure 6, here uses the row-based idea from the thesis by Yin, P et al[20]. To solve the CTA/CPA task, we implement the idea in a different way. Our goal is to capture the relationship between the target column and the context columns by inputting the context data row-wise. To achieve this, we follow the approach proposed by TUTA [8], where we prepend a data type token to each cell. This row serialization approach helps the Transformer's attention mechanism better capture the table structure while maintaining constant intervals for each column. Moreover, by including data types, the language model can learn the relationships between the cells of the target column and the prediction label. For instance, a URL data for company, label, or hotel photos may indicate a target label of "Logo".
To reduce the number of additional tokens required for inserting data types in each cell, we introduce Column-TaBERT Serialization, which infers the data type for each column by majority count and includes the data types for each column only once. For the CTA task, the resulting serialization is illustrated in Figure 6 or: \[serialize_{col\_tabert}(C)::= [CLS]\;target\_col\_type\;target\_col\_data\;[SEP]\;context\_col\_1\_type\;context\_col\_2\_type\;...\;[SEP]\;row\_data\;[SEP]\]


4.3 Models
Our approaches are built upon Transformer-based language models. We utilize mainly BERT and RoBERTa models and experimentally with the LongFormer model for the CPA task.
BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing model based on Transformers, which was proposed by Google [14]. It uses bidirectional encoders to learn context-dependent word representations and has achieved outstanding performance in various NLP tasks, such as sentiment analysis, question answering, and natural language inference.
RoBERTa (Robustly Optimized BERT approach) is an improved version of the BERT model that was proposed by Facebook AI [15]. RoBERTa is pre-trained on a massive amount of text data with longer training times and larger batches than BERT, which improves its performance in downstream NLP tasks.
LongFormer Transformer-based models suffer from a limitation when it comes to processing long sequences which is a challenge in table annotation where long texts need to be processed. While models such as BERT or RoBERTa [14, 15] are limited to a sequence length of 512, [13] proposes Longformer, which has a linear attention mechanism that can process longer sequences length. The maximum length is constrained by the memory of GPU.
We further explore different experimental models for the CPA task in various iterations.
(i) Subtable Model: According to the statistics provided by SOTAB Benchmark [1], the CPA task involves 48,379 tables with a total of 174,998 columns. Notably, the tables have a median of 42 rows and 8 columns. To increase the diversity of the training data and incorporate more parts of each table, we divide each table into subtables consisting of a maximum of 20 rows per subtable and up to 5 subtables per table. As a result, our approach now encompasses a total of 391,115 columns.
(ii) 2-Step Model: To improve the accuracy of our predictions, we adopt a 2-step model that leverages the 17 schema.org labels associated with each table. Specifically, we first predict the schema type of a table using a model trained on the SOTAB Benchmark [1] training data. This allows us to restrict our predictions to only relevant labels for each column. For example, if a column is identified as "Book", it will not be predicted as "MusicRecording/Name" for a CTA task. Next, we train submodels for each schema type to predict the final label for each column. To make predictions on test data, we first predict the schema type and then use the corresponding submodel to predict the final label for each column.
4.4 Hyperparameter Tuning
To investigate the impact of different window sizes for Neighboring Column Serializations, we conduct hyperparameter tunings for both CTA and CPA tasks. We also experiment with different preprocessing methods, such as TUTA, median, and mean preprocessing for each promising serialization. Additionally, we vary the proportions of the target column and local context (MP - Main Percentange Ratio) to assess the effect of context weight on different serializations, experimenting with MP Ratios of 0.2, 0.5, and 0.8, respectively. For instance, the MP Ratio of 0.8 for Neighboring Column Serialization, for the CPA task, allocates 80% of the available token limit to the Main and Target Columns and only 20% to their neighbors. Finally, we investigate the performance of different augmentation strategies.
5. Experimental Approach and Evaluation
This Chapter describes our experimental setup, results and experimental evaluation.
5.1 Setup
For our experiments on both the SOTAB benchmark and WikiTables dataset, we trained a RoBERTa model with default settings: 30 epochs, 3 runs (with seed 0-2), and a batch size of 32, if not specified. The initial learning rate was set to 5e-5, and we used a linear decay scheduler with no warmup. We set a maximum token length of 512, used a window size of 5, and a standard MP ratio of 0.5 for relevant serializations. We identify single-column serialization as the baseline configuration. For the SOTAB benchmark, we formulated the task as a multi-class prediction problem and used Cross Entropy loss. For the WikiTable dataset, we formulated the task as a multi-label prediction problem and used Binary Cross Entropy loss. We evaluated the performance of our models using micro F1 scores for both datasets.
To conduct our experiments, we utilized computing resources from the University of Mannheim (dws-server) and bwHPC servers provided by the state of Baden-Württemberg.
5.2 Experiments for SOTAB benchmark
For the SOTAB benchmark, we conducted extensive experiments on serialization for both CTA and CPA tasks. Specifically, we tested different variants of Neighboring Column Serialization and TaBERT Serialization. For CPA, we also performed No-Main Neighboring Column Serialization, which treats a CPA task as a CTA task.
In order to explore more possibilities for the challenging CPA task, we further investigated Subtable, 2-Step, and Longformer models. For the Subtable model, we trained a BERT model for 30 epochs with a batch size of 16 and a maximum token length of 512 tokens. We utilized a Neighboring Column Serialization of window size 2 for this model. For the 2-Step model, we used default parameter settings, including a RoBERTa model trained for 30 epochs. For the Longformer model, we used single-column serialization with a batch size of 8 and maximum length of 1000.
5.3 Experiments for WikiTables dataset
To test the generalizability of our findings for the SOTAB benchmark, we transferred the serialization methods to the WikiTables dataset. For the CTA task, we used only Column-TaBERT Serialization due to computational limitations resulting from the dataset's size. For the CPA task, we experimented with different serializations, including TaBERT, Column-TaBERT, and No-Main Neighboring Column Serializations. We utilized the default experimental parameters as described and following Suhara et al. [7], we trained the model for 15 epochs.
5.4 Main Results
SOTAB benchmark
Table 2 reports the results for SOTAB benchmark. All serialization methods except DODUO and TURL utilize the RoBERTa model. Across both tasks, for all but one serializations implemented, we achieved significant improvement on Micro F1 against the single-column baseline serialization. For the CTA task, we are able to achieve a maximum increase of 7.62 point F1 score compared to the state-of-the-art DODUO model, with Neighboring Column Serialization of window size of 5 and a MP Ratio of 0.2. Similar results are achieved by TaBERT and Neighoring Column Serialization indicating the importance of context ratio and certain structure for input sequence. The lower F1 score of Column-TaBERT compared to its counterpart of TaBERT, and like-wise for Neighboring Column with Summary, by roughly 3 points suggests the inability of Transformer's attention mechanism to identify significant positions in a input sequence.
For the CPA task, different variances of Neighboring Column Serialization achieve 6 point higher F1 score than the existing methods. Once again, Neighboring Column Serialization of window size of 5 and MP Ratio of 0.2 achieves the highest F1 score. TaBERT variances score 2.5 point lower F1 score indicating less suitability of row serialization for a column pair prediction. Once again, less structured serialization solutions achieve lower F1 scores (Column TaBERT and Neighboring Column with Summary). Lastly, only Neighboring Column with Summary with MP Ratio of 0.8 achieves lower F1 score than Single-Column. Overall, our results highlight the importance of carefully selecting the appropriate serialization method to achieve optimal performance in table annotation tasks.
Longformer's performance in comparison to BERT is lower, and one possible reason for this is that the Longformer model employs a different attention mechanism, which may not converge well with the current set of parameters
For the additional experiments, we did not achieve improvement on Test F1 score. Notably for LongFormer model, further modifications on the model setup should be investigated. For the 2-step model, we achieve a Test F1 Score of 96.73 for the schema type prediction model. Building on top of this model, the final ensembly of submodels by schema class achieved a Test F1 Score of 84.34. Upon further investigating, decoupling the first model and inserting test data with correct schema labels into respective submodel merely increased the final Test F1 score to 84.81.
MP Ratio | |||
---|---|---|---|
*: MP Ratio is irrelevant; **: Results from [1] | |||
CTA | 0.2 | 0.5 | 0.8 |
Single-Column* | - | 82.97 | - |
DODUO*,** | - | 84.82 | - |
TURL*,** | - | 78.96 | - |
TaBERT | 92.14 | 92.24 | 91.60 |
Col-TaBERT | 87.92 | 88.40 | 88.67 |
Neighbor, WS=5 | 92.44 | 92.14 | 91.29 |
Sum_Neighbor, WS=5 | 90.97 | 88.63 | 85.82 |
MP Ratio | |||
---|---|---|---|
*: MP Ratio is irrelevant; **: Results from [1] | |||
CPA | 0.2 | 0.5 | 0.8 |
Single-Column* | - | 79.52 | - |
DODUO*,** | - | 79.96 | - |
TURL*,** | - | 72.93 | - |
TaBERT | 83.09 | 83.68 | 82.49 |
Col-TaBERT | 82.79 | 83.22 | 82.38 |
Neighbor, WS=5 | 86.09 | 85.50 | 85.14 |
Sum_Neighbor, WS=5 | 83.50 | 82.07 | 77.82 |
No-Main Neighbor, WS=5 | 85.88 | 85.29 | 84.87 |
Test Micro F1 | |
---|---|
Subtable Model* | 83.20 |
Two-Step Model* | 84.34 |
Longformer* | 51.74 |
WikiTables Dataset
Table 3 reports the results for WikiTables Dataset. Similarly to the SOTAB benchmark, all serialization methods utilize the RoBERTa model except for DODUO and TURL, which we report results measured in [1]. In the CTA task, our Column-TaBERT Serialization achieved a comparable F1 score to the state-of-the-art DODUO, missing by only 1 point. This is a significant improvement over the TURL baseline, which we outperformed by 2.5 points. However, our results from SOTAB benchmark suggest that there may be rooms for improvement using more performant serialization methods. For the CPA task, our TaBERT Serialization outperformed existing methods by 1.5 points in terms of F1 score. These results demonstrate that our serialization approaches can achieve comparable performance on different real-world datasets. Overall, our findings suggest that careful selection of serialization methods can significantly improve the performance of table annotation tasks.
Test Micro F1 | |
---|---|
Col-TaBERT | 91.42 |
DODUO | 92.45 |
TURL | 88.86 |
Test Micro F1 | |
---|---|
Col-TaBERT | 92.85 |
DODUO | 91.72 |
TURL | 90.94 |
TaBERT | 93.30 |
No-Main Neighbor | 91.60 |
5.5 Results of Different Hyperparameters
This section describes the hyperparameter tuning results for selected serializations. We conducted preprocessing of fixing cell length in a tables, implementing different data augmentation techniques, altering the window size for Neighboring Column serialization and changing the MP Ratio.
Preprocess and Augmentation
Through our experiments, as shown in Table 4a, we found that preprocessing provides slight improvement on the performance of different serialization methods. This highlights the importance of providing structured input sequences to Transformer models and using more columns whenever possible. Furthermore, our initial experiments with data augmentation using single-column serialization indicate that, for large datasets like the SOTAB benchmark, the need for data augmentation may be limited. This is reported in Table 4c.
Window Size
For the SOTAB benchmark, we identified that the suitable window size for Neighboring Column Serializations are window size of 4 - 6, as shown in Table 4b. This indicates that a complete table structure with relevant columns is a suitable input for language models in table annotation tasks. Notably, we observed that the two main differences between Neighboring Column Serialization and DODUO are the structure of the input sequence and the pretrained language model used . Specifically, our approach places the target column at the front of the input sequence to focus on it, while DODUO takes the entire table as input. For model, We used RoBERTa model instead of BERT model used by DODUO.
MP Ratio
MP (Main Percentage Ratio) played significant value for certain serializations such as Neighboring Column with Summary serialization. However, the results in Table 2 confirms that different serialization requires different MP Ratio due to the nature of its serialization strategy.Preprocessing Method |
CTA with TaBERT MP Ratio=0.5 |
CPA with No-Main Neighbor MP Ratio=0.5 |
---|---|---|
TUTA | 92.45 | 85.77 |
MEAN | 92.05 | 85.70 |
MEDIAN | 92.24 | 85.68 |
Window Size | CTA | CPA |
---|---|---|
1 | 89.59 | 83.31 |
2 | 91.20 | 84.71 |
3 | 92.09 | 85.31 |
4 | 92.21 | 85.86 |
5 | 92.14 | 85.50 |
6 | 91.80 | 85.79 |
7 | 91.78 | 85.81 |
Augmentation method | Test Micro F1 |
---|---|
Baseline | 77.95 |
Delete Random Cell | 77.28 |
Replace with Frequent Value | 78.00 |
Shuffle Column Cell | 77.29 |
Shuffle Column Cell Value | 77.05 |
Swap Word | 74.13 |
Replace Word | 75.48 |
5.6 Error Analysis
To evaluate the performance, we conducted error analysis on the Neighboring Column Serialization with window size of 2. We chose this approach because it represents the promising serialization method. The parameters utilized is a BERT model trained for 30 epochs with batch size of 16. We initially conducted statistic analysis on three different aspects: (i) schema type, (ii) data type, and (iii) specific challenges. Subsequently, we further looked into the data and manually classified the errors.
(i) Schema Type: The tables are distributed across 17 schema.org types and the tables do no include any metadata which means no table headers or cpations are available. We list the top 5 labels that achieved the highest score in the overall test set and the top 5 labels that reached the lowest score for CTA in Table 5a and CPA in Table 5b. The performance of CTA and CPA varies across different schema types. MusicRecording is the only schema where both tasks perform well, while CreativeWork and Product are weak for both tasks. Recipe and Person are among the top 5 schemas for CTA, but are among the lowest 5 for CPA.
We attribute this discrepancy to the fact that CPA employs more comprehensive and detailed classification labels. For instance, the Recipe schema in CPA includes specific labels for fatContent, sugarContent, and proteinContent, whereas in CTA, they are all labeled as Mass. Additionally, cookingTime and prepTime in CPA are classified as Duration in CTA. In Person schema, the CTA tasks are more straightforward such as Person/name, telephone, and Country. while CPA has more complicated tasks like affiliation, memberOf, worksFor, alumniOf which can be ambiguous.
Top 5 | Lowest 5 | ||
---|---|---|---|
Label | Overall Test F1 | Label | Overall Test F1 |
Recipe | 96.23 | Product | 86.54 |
Museum | 95.40 | Hotel | 86.05 |
Event | 93.89 | JobPosting | 84.14 |
Person | 93.07 | CreativeWork | 82.50 |
MusicRecording | 92.59 | MusicAlbum | 74.07 |
Top 5 | Lowest 5 | ||
---|---|---|---|
Label | Overall Test F1 | Label | Overall Test F1 |
Restaurant | 96.25 | SportsEvent | 84.50 |
TVEpisode | 96.08 | Recipe | 82.25 |
MusicRecording | 94.64 | Person | 81.24 |
Place | 93.18 | CreativeWork | 76.36 |
LocalBusiness | 90.04 | Product | 70.34 |
(ii) Data Type: The columns are divided into 5 groups based on their data type: Boolean, Numeric, URL, Short Text and Long Text. Boolean columns contain only "yes" or "no" or "true" or "false" values, Numeric columns contain only numbers, and URL columns contain web addresses. All other columns are classified as text, and if the token in the value is less than five, it is considered Short Text, otherwise, it is Long Text. The results of each data type with micro-F1 score and distribution are summarized in Table 6a for CTA and Table 6b for CPA.
We observed that the Boolean data accounts for a very small portion of the overall test sets. Therefore, any predictions made on this data can significantly impact the final score. For instance, in CPA task, although there were only 3 incorrect predictions out of a total of 20 Boolean data, the score decreased notably. Upon further investigation, we found these errors were due to wrong truth labels of boolean data as height or gtin8. When comparing the CPA and CTA tasks in other data types, it appears that the CPA tasks have lower scores. This could be due to the fact that the CPA tasks involve splitting the label spaces further, which can make it more challenging to accurately assign the correct labels. For instance, in Numeric data, the CPA tasks have large numbers of incorrect predictions between gtin, gtin12, gtin13 and gtin14. Also in URL data, some URLs refer to the source of images. In CPA tasks, these URLs are further split into more specific categories such as logo, image, and photo which have high degree of similarity and make it more difficult for the model to distinguish between them. The same issue also applies to Short Text and Long Text data types.
Data Type | Overall Test F1 | Percentage of the type |
---|---|---|
Boolean | 100 | 0.3 |
Numeric | 91.77 | 4.7 |
URL | 97.06 | 6.8 |
Long Text | 86.92 | 18.8 |
Short text | 90.27 | 69.4 |
Data Type | Overall Test F1 | Percentage of the type |
---|---|---|
Boolean | 85 | 0.1 |
Numeric | 75.06 | 7.8 |
URL | 94.71 | 4.5 |
Long Text | 88.17 | 12.5 |
Short Text | 82.52 | 75.2 |
(iii) Specific Challenges: The specific challgenges include: Missing Values, Value Format Heterogeneity, Corner Cases, and Random Columns which were all clearly defined in WDC Schema.org Table Annotation Benchmark (SOTAB)[1]. The results are summarized in Table 7 for CTA and CPA.
CTA | CPA | |
---|---|---|
Test (Missing Values) | 90.60 | 82.74 |
Test (Format Heterogeneity) | 97.09 | 86.55 |
Test (Corner Cases) | 85.06 | 73.48 |
Test (Random Columns) | 90.89 | 85.36 |
Error Sampling
Following the statistical analysis of our results, we delved deeper into the data itself to investigate the root causes of the errors. To do this, we sampled errors in prediction by schema for both the CTA and CPA tasks. Specifically, we sampled 614 errors from a total of 1533 observations for CTA, and 617 errors from 3853 observations for CPA. We manually checked these tables and classified the types of errors we found, which are listed in Table 8. The distribution of error types in the samples are illustrated in Figure 9.
Observed Error Types | Definition | Example |
---|---|---|
Actual wrong label | The label given in the table is obviously incorrect. | 10-digit-number is labeled as gtin8 |
Bad prediction | The data provided by the column and neighbor columns is sufficient but the prediction is incorrect. | gtin12 is predicted as gtin14 |
Schema related label | The prediction correctly represents the data but does not match the associated schema. | In Movie schema, Movie/name is predicted as Book/name |
Semantic label | The prediction has the same data type or format as label, but the interpretation of the data was incorrect. | FAXnumber is predicted as telephone number and vice versa |
Duplicate label | Two columns in a table that have same values but are labeled differently, leading to data redundancy and potential confusion or inconsistency in the interpretation of the data. In this case, usually the wrong prediction is the label of the other column. | columns with same value are labeled as Brand and Organization respectively |
Hierarchical label | The prediction may represent a higher or lower level of abstraction than the label, but the semantic meaning is still understandable. |
|
Bad tables | Approximately 80% of the values in the table are either empty strings, 'undefined', or contain erroneous data inputted by humans. | Example given in Figure 8a. |
Lack of info | The data provided is insufficient in current and neighbor columns but contained in columns outside the windows. | Example given in Figure 8b. In CTA task, the target is to predict column 12, while the useful information is included in first few columns. |




Relative Frequency Definition
In addition to the absolute count of errors, we also aim to investigate the relative frequencies of errors and correct predictions, to determine corner cases. Therefore, we seek to compare the ratio of incorrect predictions to correct predictions based on the same label.
\[Relative Frequency(P,T) = \frac{P}{{T}}\]where P is counts of prediction for specific (label, prediction) pair, and T is counts of prediction of total pairs of the label. For example, suppose we have a total of 100 predictions made for the 'ProductModel' label, of which 20 were incorrectly predicted as 'IdentifierAT' and 60 were correctly predicted as 'ProductModel', Therefore, the Relative Frequency of error prediction pair of ('ProductModel', 'IdentifierAT') is 0.2 and the Relative Frequency of correct prediction pair of ('ProductModel', 'ProductModel') is 0.6. We calculated the frequency on all prediction, and listed the top 5 pairs with the highest error frequency in Table 9a for CTA and Table 9b for CPA. We also cross-checked those pairs in our samples above-mentioned to determine the error types of each pair. Pairs might have multiple possible error types depending on the table being used.
Error Frequency for CTA
Upon analyzing CTA's Top 5 errors, as shown in Table 9a, we observed that the majority were related to Hierarchical Labels, which was consistent with the sampled data. We believe that refining the 2-step model could address this issue. Despite the model's overall effectiveness in capturing the meaning of predicted values, inaccuracies persisted, particularly with the addressRegion and addressLocality pair. We discovered that while the schema defines addressRegion as the country's region, addressLocality corresponds to a city within that region. Unfortunately, we encountered numerous inconsistent labeling and mixed-up terms in tables. Therefore, future work must focus on preprocessing to address this problem by handling inconsistent labels.
Label | Prediction | Pair count | Total count | Error Frequency | Correct Frequency | Possible error types |
---|---|---|---|---|---|---|
MusicRecording | MusicRecording/name | 4 | 14 | 0.286 | 0.571 | Hierarchical label |
ProductModel | IdentifierAT | 60 | 268 | 0.224 | 0.496 | Hierarchical label |
faxNumber | telephone | 34 | 172 | 0.198 | 0.791 | Semantic label |
addressRegion | addressLocality | 43 | 221 | 0.195 | 0.738 | Duplicate label, Bad prediction, Actual wrong label |
MusicGroup | MusicArtistAT | 2 | 11 | 0.182 | 0.818 | Hierarchical label |
Error Frequency for CPA
In CPA's top 5, shown in Table 9a, the most common error type was related to semantic labeling, which highlights the challenges in accurately identifying and labeling the meaning of data values. Consistent data formats further complicate this issue, making it difficult for even humans to distinguish between them without contextual information. To address this issue, we propose including additional relevant columns, such as information about the stadium, to disambiguate between pairs of data like awayTeam and homeTeam, which can potentially improve the accuracy of the model's predictions.
Label | Prediction | Pair count | Total count | Error Frequency | Correct Frequency | Possible error types |
---|---|---|---|---|---|---|
awayTeam | homeTeam | 10 | 16 | 0.625 | 0.375 | Semantic label |
ratingCount | reviewCount | 65 | 199 | 0.327 | 0.568 | Semantic label |
height | width | 57 | 192 | 0.297 | 0.464 | Semantic label |
productID | sku | 42 | 154 | 0.273 | 0.26 | Semantic label, Duplicate label |
dateCreated | datePublished | 62 | 249 | 0.249 | 0.558 | Semantic label |
6. Discussion and Conclusion
This section is dedicated to the discussion of findings for the project. We then conclude the project and provide potential implications for future research in the field of table annotation.6.1 Discussion
We draw inference on the discrepency of results on the SOTAB and WikiTable datasets, followed by our analysis on why Column-TaBERT serialization performed wose than its counterpart, TaBERT serialization.
WikiTable dataset
In Table 3a and Table 3b, we can see that the best serialization differs between the SOTAB and WikiTable datasets. One reason for this discrepancy could be that the WikiTable dataset contains more tables with fewer columns, such as tables having only 2 columns. The difference is significant as calculated in Table 1. This could cause the Neighboring Column method to perform less and result in a lower F1 score on the test set.
Column-TaBERT
We hypothesized that the Column-TaBERT method would be able to encode more information with same token limit as TaBERT, thus producing better results. However, the results indicate that the column data type performed worse than the cell data type. This may be due to the fact that the structural design did not effectively enable the attention mechanism. In TaBERT, the model has one data type token placed just before the cell value. In contrast, Column-TaBERT obtains the dat type only once and structures it as a normal cell value. Future experiments could explore ways to improve attention by altering this serialization design.
6.2 Conclusion
Drawing a conclusion, we have achieved significant improvement on current methods for both CTA and CPA tasks for the SOTAB benchmark and comparatively on the WikiTables dataset. We conclude that for a Transformer-based language model, structureness in input sequence, both column serialization and row serialization plays significant role in attention mechanism. With a large training data like SOTAB benchmark, it is apparent that data augmentation methods, such as preprocessing and augmentation are less useful. However, relevant experiments should be conducted when dealing with smaller datasets.
Moving forward, we were not able to fully utilize the potential of Longformer with more input sequence length. Additionally, due to time limit, we did not further develop specific data augmentation methods for different promising serialization methods. Lastly, optimization could be done to increase the training speed such as increasing batch size, model checkpointing and parallelism. On average, 1 epoch of training a RoBERTa model for SOTAB benchmark requires 1 hour for our machines and 5 times the duration for WikiTables dataset.
As table annotation using deep learning is still at its infancy, future works may include Longformer, Contrastive Learning or Large Language Models like ChatGPT.
6.3 Acknowledgments
The authors acknowledge support by the state of Baden-Württemberg through bwHPC.
7. References
[1] Korini, K., Peeters, R., & Bizer, C. (2022).
SOTAB: The WDC Schema.org Table Annotation Benchmark. SemTab@ISWC.
[2] Ritze, D., Lehmberg, O., Oulabi, Y., & Bizer, C. (2016).
Profiling the Potential of Web Tables for Augmenting Cross-domain
Knowledge Bases. Proceedings of the 25th International Conference on World Wide Web.
[3] Rahm, E., & Bernstein, P.A. (2001).
A survey of approaches to automatic schema matching. The VLDB Journal, 10, 334-350.
[4] Deng, X., Sun, H., Lees, A., Wu, Y., & Yu, C. (2020).
TURL: Table Understanding through Representation Learning. ArXiv, abs/2006.14806.
[5] Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017).
Attention is All you Need. ArXiv, abs/1706.03762.
[6] Hulsebos, M., Demiralp, C., & Groth, P. (2021).
GitTables: A Large-Scale Corpus of Relational Tables
ArXiv, abs/2106.07258.
[7] Suhara, Y., Li, J., Li, Y., Zhang, D., Demiralp, C., Chen, C., &
Tan, W.C. (2021).
Annotating Columns with Pre-trained Language Models. Proceedings of the 2022 International Conference on Management of
Data.
[8] Wang, Z., Dong, H., Jia, R., Li, J., Fu, Z., Han, S., & Zhang, D.
(2020).
TUTA: Tree-based Transformers for Generally Structured Table
Pre-training. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery
& Data Mining.
[9] Chen, J., Jiménez-Ruiz, E., Horrocks, I., & Sutton, C. (2019).
Learning Semantic Annotations for Tabular Data. ArXiv, abs/1906.00781.
[10] Khurana, U., & Galhotra, S. (2021).
Semantic Concept Annotation for Tabular Data. Proceedings of the 30th ACM International Conference on Information &
Knowledge Management.
[11] Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, C., & Tan,
W.C. (2019).
Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow., 13, 1835-1848.
[12] Brugman, S.
pandas-profiling: Exploratory Data Analysis for Python
[13] Beltagy, I., Peters, M.E., & Cohan, A. (2020).
Longformer: The Long-Document Transformer. ArXiv, abs/2004.05150.
[14] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019).
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. ArXiv, abs/1810.04805.
[15] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O.,
Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019).
RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, abs/1907.11692.
[16] Nguyen, P., Kertkeidkachorn, N., Ichise, R., & Takeda, H. (2020).
MTab4DBpedia: Semantic Annotation for Tabular Data with DBpedia.
[17] Lin, T., Wang, Y., Liu, X., & Qiu, X. (2021).
A Survey of Transformers. AI Open, 3, 111-132.
[18] Bhagavatula, C., Noraset, T., & Downey, D. (2015).
TabEL: Entity Linking in Web Tables. International Workshop on the Semantic Web.
[19] Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang,
F., & Liu, Q. (2019).
TinyBERT: Distilling BERT for Natural Language Understanding. Findings.
[20] Yin, P., Neubig, G., Yih, W., & Riedel, S. (2020).
TaBERT: Pretraining for Joint Understanding of Textual and Tabular
Data. ArXiv, abs/2005.08314.
[21] Google (2015).
Freebase Data Dumps.