|
What is 1000Projects
'1000projects.com' is an educational content website dedicated to finding and realizing Final Year Projects, IEEE Projects, Engineering Projects, Science Fair Projects, Project Topics, Project Ideas, Major Projects, Mini Projects, Paper Presentations, Presentation Topics, IEEE Topics, .Net Projects, Java Projects, PHP Projects, VB Projects, SQL Projects, C & DS Projects, C++ Projects, Perl Projects, ASP Projects, Delphi Projects, HTML Projects, Cold Fusion Projects, Java Script Projects, Btech Projects, BE Projects, MCA Projects, Mtech Projects, MBA Projects, Project on Software, CBSE Projects, Testing Projects, Embedded Projects, Chemistry Projects, Electronics Projects, Electrical Projects, Science Projects, Mechanical Projects, Mba project Reports, Placement papers, Sample Resumes, Entrance Exams, Technical Faq's, Puzzles, etc
how it works?
Everything on this site is submitted by the students in this professional community. You Can submit your Projects, Project Topics & Ideas to info.1000projects{at}gmail.com after you submit your project/project Idea/Abstract/Seminar Topics, These are being verified and approved by our administrator. after approval of this project/project Idea/Abstract/Seminar Topics, It can be shown on 1000projects.com so that other users can read/discuss it.The entire content on this website is Only For Educational Purpose, Non Commercial use!
Please help us/Other Users by sending projects/project Ideas/Abstracts/Seminar Topics. Thanking You!!!!!
|
Automatic Identification of Informative Sections of Web Pages
|
|
|
Automatic Identification of Informative Sections of Web Pages
Abstract ::
Abstract: Web pages - especially dynamically generated ones - contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.
|
|
|
|