专利名称 | Systems and methods of directionally guided, discriminate crawling of internet real estate listings | ||
申请号 | AU2007100279 | 申请日 | |
公开(公告)号 | AU2007100279A4 | 公开(公告)日 | |
申请(专利权)人 | BREEZ BRANDER | 发明人 | Brander Breez |
专利来源 | 国家知识产权局 | 转化方式 | |
摘要 |
Web crawler with plurality of crawlers (modules). Each module thread determines the initial URLs, from which information is to be downloaded and the crawl is to be directed further, by retrieving the pre-set entry URLs and their corresponding web pages with real estate content on them, downloading the document corresponding to the entry URLs, possessing the document and web pages. If entry URL contains real estate listings, the crawler thread filters the web page by pre-set filters and extracts the relevant words and hyperlinks, processes and categorises the filtered data and inputs it into a database, filters the same page for hyperlinks to similar real estate listings pages on the same website and processes those web pages the same as the initial one. Hyperlink filters utilise unique dynamic dictionary and thesaurus procedures. If the crawler thread does not find any pre-set typical content on the initial crawled web page it utilises the same dictionary and thesaurus procedures to filter the web page to determine if the web page has merely changed in source code or really does not contain any real estate listings data. If with this procedure it determines that it indeed does contain real estate listings data and that the web page source code has changed it updates the filters to remember the new code and processes the web page and filters out the real estate listings words and hyperlinks (data), processes and categorises the data and inputs it into the database, filters the same page for hyperlinks to similar real estate listings pages on the same website and processes those web pages the same as the initial one. If the crawler thread does not find any real estate listings data on the initial page after employing both procedures explained above it just filters the page for any possible hyperlinks that by the words in them would indicate they lead to a page on this website where real estate listings might be found. The crawler thread processes the entire initial website and then follows all outbound hyperlinks (hyperlinks to other websites on different domain names and/or Internet Protocol addresses) that it found on this website and flagged (remembered). The exact same crawl procedure explained above then repeats on this next website and on the next one until the crawler finds no more matching websites to crawl. |
主管部门:海南中小企业服务 | 建设单位:海南商业联合会
版权所有:海南商业联合会 | 备案号:粤ICP备13083911号(ICP加挂服务)@2017