tag:blogger.com,1999:blog-6540289076858785139.post7591878795907660260..comments2023-05-05T13:33:17.044+01:00Comments on DigitalPebble's Blog: Towards Nutch 2.0Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-6540289076858785139.post-42678395972888783642010-12-02T12:43:04.519+00:002010-12-02T12:43:04.519+00:00Re-MySQL : I would not jump to any conclusion to q...Re-MySQL : I would not jump to any conclusion to quickly before having compared with GORA+HBase first. It could easily be a problem with the way things are done in Nutch 2.0 or even with the implementation of the MySQL backend in GORA. Could be interesting to hear from your experiences with GORA+HBase versus GORA+Mysql version Nutch 1.2.Julien Niochehttps://www.blogger.com/profile/16499716503708780310noreply@blogger.comtag:blogger.com,1999:blog-6540289076858785139.post-85951742023074983382010-12-02T03:23:31.447+00:002010-12-02T03:23:31.447+00:00Sorry for the late answer.
I will definitely try ...Sorry for the late answer.<br /><br />I will definitely try Nutch 2.0 with Hbase as datastore from now on.<br />As regards the migration utility, I can give it a shot.<br /><br /><br />Regarding the MySQL issue, this sounds like an I/O issue? A coworker told me MySQL does not support Non blocking I/O, for whatever that means and the veracity of is yet to be verified. The thread that requests data from the MySQL server might block on every single query and prevents any other query to run simultaneously, unless you use a different connection.<br /><br />Probably a way to speed-up the connect, read and write operations would be to setup the MySQL database locally. But this seems pretty incompatible with the distributed nature of a Hadoop job.<br /><br />If we have to stick with a remote server, another way would be to use a pool of connections.Alexishttps://www.blogger.com/profile/05175591779484188477noreply@blogger.comtag:blogger.com,1999:blog-6540289076858785139.post-45067811815599231002010-11-12T12:37:23.011+00:002010-11-12T12:37:23.011+00:00Re-saving the fetch step : this could definitely b...Re-saving the fetch step : this could definitely be done as well but would require writing a bit of code for converting to the 1.x segments to 2.0. This would be a nice contribution BTW ;-)<br /><br />As for the speed problem, it's not so much that Nutch2.0+MySQL is slower, the problem is that for some reason it does not get as many URLs as 1.2. Could be a problem with the MySQL backend in GORA and it would be worth testing it with HBase instead.<br /><br />2.0 is definitely under development, why don't you give it a try anyway? Testing and reporting issues is definitely a way of getting it up to speed.Julien Niochehttps://www.blogger.com/profile/16499716503708780310noreply@blogger.comtag:blogger.com,1999:blog-6540289076858785139.post-34113361046897993182010-11-10T18:52:41.903+00:002010-11-10T18:52:41.903+00:00Thanks for the prompt reply!
Nutch 2.0 is current...Thanks for the prompt reply!<br /><br />Nutch 2.0 is currently (too much) slower than the 1.2 version according to the issue.<br />I was interested in migrating the crawldb and the segments to the datastore, not only to avoid rediscovering the urls but also to save the fetch step which takes the most time.<br />I intended to reload all the data generated by my previous generate/fetch/update iterations...<br /><br />I guess I'll stick with 1.2 for now since 2.0 is apparently still under development.Alexishttps://www.blogger.com/profile/05175591779484188477noreply@blogger.comtag:blogger.com,1999:blog-6540289076858785139.post-17791781731891029332010-11-09T09:29:51.922+00:002010-11-09T09:29:51.922+00:00Hi Alexis,
There aren't any tools for migrat...Hi Alexis, <br /><br />There aren't any tools for migrating to 2.0 yet but it wouldn't be too difficult to write that.<br />What you can do already is to get a list of the URLs in your 1.2 crawldb and inject that into 2.0. You'd have to refetch these URLs of course but at least you wouldn't have to rediscover them.<br />Please note that 2.0 is at an early stage and has some open issues such as https://issues.apache.org/jira/browse/NUTCH-879. However it is worth playing with it anyway (and reporting bugs if you find any)Julien Niochehttps://www.blogger.com/profile/16499716503708780310noreply@blogger.comtag:blogger.com,1999:blog-6540289076858785139.post-76787214444138031592010-11-08T21:49:31.247+00:002010-11-08T21:49:31.247+00:00I have created a few segments with Nutch 1.2. Are ...I have created a few segments with Nutch 1.2. Are there any tools that would let me migrate this crawl data to Nutch 2.0?<br />Would it be hard to create such a tool?Alexishttps://www.blogger.com/profile/05175591779484188477noreply@blogger.com