Beschreibung
It is said that the world knowledge is in the Internet. Scientific knowledge is in books, journals and conference proceedings. To cope with the huge amount of information clever algorithms are needed. They are filtering, sorting and ultimately mining the information, improving as they get more data. A common technique is to mine the text from the publications. But publications include more information than the their text. The position of a word gives clues about its meaning. Additional images either supplement the text or offer proof to a proposition. Tables only form semantic units when read in rows and columns. To deal with the additional information, classic text mining techniques have to be coupled with spatial data and image data. For this thesis a framework was developed that allows the analysis of layout information in scientific documents. This framework has been used for three case studies. The first one allows the automatic extraction of images and their annotation in the paper. The second one refines that approach as images are further classified into semantic categories based on their content. The third case study examines the use of tables in this context. They all discover knowledge that would not have been visible through classical text mining and give hard evidence to the hypothesis that using layout does indeed improve the possibilities of text mining.