Full Program »
Semantic Information Extraction From Multi-Modal Technical Documents
Industrial Standards such as OPC UA, IEC, IEEE standards etc., are the core of industrial automation as they enable an interoperable industrial ecosystem. These standards describe semantics that should be implemented by the machines that claim to follow them. However, these semantics are defined in textual format in the standard, which makes it challenging to verify the compliance of a standard-compliant machine description against the semantics defined in the standard. This hinders the interoperability between machines from different vendors. However, in the context of Industry 4.0, it is crucial to enable interoperability between machines and, thus, an open ecosystem to ensure industrial automation. Therefore, there is an urgent need to extract and formalize the semantics from the industry standards to facilitate the compliance verification of machine descriptions against industry standards. Currently, no comprehensive method is described in the state-of-the-art to address this issue. In this thesis, we investigate how to solve the problem of (semi-)automatically extracting and formalizing semantics from the industrial standard technical specifications using Semantic Web Technologies (SWT), Natural Language Processing (NLP), Machine Learning (ML) and knowledge-based information extraction methods. Mainly we focus on (i) Named Entity Recognition on the text of the technical documents; (ii) binary and multi-class rule (constraint) classification on limited annotated specific textual data; (iii) information extraction from tables, and (iv) finding the contextual similarity between the information from the text and the tables of the same document. We use OPC UA Companion Specifications as a use case.