An Approach For Web Data Extraction Based On Similarity Of Templates
Abstract Category: I.T.
Course / Degree: M.Phil. C.Sc
Institution / University: Prist university, Vallam, Thanjavur, India
Published in: 2013
Large amounts of data are available in a large collection of documents from Heterogeneous web pages. Web data extraction has been an important part for many Web data analysis applications. In this thesis, we formulate the data extraction problem based on structured data and tree templates. We propose an unsupervised, page-level data extraction approach to deduce the templates for each individual Deep Website, which contains data records in one Webpage. we use relative path weights to find similarity of underlying template structures in the documents and we cluster the web documents based on the similarity of underlying template structure in the documents so that template is extracted with various clusters. Once the template is identified, the documents are grouped and extracting data is managed through template wrappers. This technique provides better performance compared to previous algorithms in terms of space and time. Our Experimental results with real life data sets confirm effectiveness and robustness of our technique.
Thesis Keywords/Search Tags:
web data extraction, template similarity
This Thesis Abstract may be cited as follows:
Thenmozhi Murugesan, 2013
Submission Details: Thesis Abstract submitted by thenmozhi murugesh from India on 05-Sep-2013 10:55.
Abstract has been viewed 3529 times (since 7 Mar 2010).
thenmozhi murugesh Contact Details: Email: thenmozhi.murugesh@gmail.com
Disclaimer
Great care has been taken to ensure that this information is correct, however ThesisAbstracts.com cannot accept responsibility for the contents of this Thesis abstract titled "An Approach For Web Data Extraction Based On Similarity Of Templates ". This abstract has been submitted by thenmozhi murugesh on 05-Sep-2013 10:55. You may report a problem using the contact form.
© Copyright 2003 - 2024 of ThesisAbstracts.com and respective owners.