tag:blogger.com,1999:blog-65402890768587851392024-03-19T09:16:35.692+00:00DigitalPebble's BlogDigitalPebble Ltd is a consulting company specialised in linguistic engineering, document management, information retrieval and extraction. Our expertise is based on open source solutions, such as Lucene, SOLR, Nutch or Gate.Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.comBlogger73125tag:blogger.com,1999:blog-6540289076858785139.post-46318797613543315192023-11-23T16:10:00.004+00:002023-11-24T09:34:16.671+00:00Meet the StormCrawler users: Q&A with the OpenWebSearch.eu project<p></p><p style="text-align: justify;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">It has been a while since our first “</span><span style="font-style: italic; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">Meet the StormCrawler users”</span><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"> blog and since StormCrawler is still going strong and used by a wide variety of users, we are delighted to put the spotlight on one of the most exciting projects that uses it. Our guests today are Michael Dinzinger and Saber Zerhoudi, both from the University of Passau in Germany.
</span></span></p><p></p><span id="docs-internal-guid-22cf4e56-7fff-63b4-3a36-d0d0345998db"><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: center;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="border: none; display: inline-block; height: 179px; overflow: hidden; width: 179px;"><img height="161" src="https://lh7-us.googleusercontent.com/Xs2ygmZ593zcKZ5KvwvOKCM5qBg96uAPuJW10T76kYldH44-4t7ouP7O9lPDEtQsqUkep51Jp1NtIIr2MKGn07Kxwlq8yAbFXjGUsqYAnvHRDHc4p-hnGE895kjIErAiZnH9xrRI4GXOEstSC_98W3w=w161-h161" style="margin-left: 0px; margin-top: 0px;" width="161" /></span></span> <span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="border: none; display: inline-block; height: 179px; overflow: hidden; width: 179px;"><img height="160" src="https://lh7-us.googleusercontent.com/tXybrHPOxJS4doWpnZAdvIg8qLy5KtDU_-V9bfA0y5L9-9yZC81lvdAlclTp0Y6KEtx8m2JFPmm8tzfHNhRZ0oT3KYhN5j4Os4K2FyRcqKRaiydQ_FCQXcuNpHfA8E4_tA1JLMfsNIcP59MwFE1T6HM=w160-h160" style="margin-left: 0px; margin-top: 0px;" width="160" /></span></span></span></p></span><h2 style="text-align: left;"><span><b style="font-family: arial; white-space-collapse: preserve;"><span style="font-size: medium;">Can you please introduce yourselves and the project you are working on?</span></b></span></h2><span><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: justify;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">Hello, we are Saber and Michael, both PhD students in Passau. Since September 2022, we have been working on </span><a href="https://openwebsearch.eu/" style="text-decoration-line: none;"><span style="color: #1155cc; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space-collapse: preserve;">OpenWebSearch.eu</span></a><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">, a European research project, in which people from now more than 15 participating institutes collaborate on building an <i>Open Web Index</i>.</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: justify;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">Our task here at </span><a href="https://www.uni-passau.de/" style="text-decoration-line: none;"><span style="color: #1155cc; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space-collapse: preserve;">Uni Passau</span></a><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"> is the collaborative and resource-efficient crawling, which is the first technical step in building the Index (see figure below). The end result are Metadata and Index files, currently in Parquet and CIFF format. These are hosted on the project partners’ shared infrastructure and will soon be available for download.</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: center;"><img height="194" src="https://lh7-us.googleusercontent.com/eb-4Zx5ACUGm5nIu9rO_50unT2DoNzSslWEqX8sglCM4kPZLBW7RqMcbI-uivqr1k24zdP6DCkCYyALV-kBKM_5cM8TkejOZh_gMErjnRXmpmldAwaxc3wgg8pJOcWiu4Dg06JQt3S0mvpXfPPWZAiA=w400-h194" style="font-family: arial; margin-left: 0px; margin-top: 0px; text-align: center; white-space-collapse: preserve;" title="Figure 1 Open Web Search Pipeline" width="400" /></p><div style="text-align: justify;"><span style="font-family: arial; text-align: left; white-space-collapse: preserve;"> By providing these files to our users, we want to empower them to work on new search applications and tap the web as a resource for their research and business ideas. The <i style="font-weight: 400;">Open Web Index </i><span style="font-weight: 400;">is in this sense a truly open, transparent and legally compliant alternative to the proprietary Web Indices of the big tech gatekeepers.</span></span></div><h2 style="text-align: left;"><span style="font-family: arial; font-size: medium; text-align: left; white-space-collapse: preserve;">How do you use StormCrawler and URLFrontier?</span></h2></span><span><p style="text-align: justify;"><span style="font-family: arial; white-space-collapse: preserve;">We use </span><a href="http://stormcrawler.net/" style="font-family: arial; white-space-collapse: preserve;" target="_blank">StormCrawler</a><span style="font-family: arial; white-space-collapse: preserve;"> to build our own crawling pipelines by configuring - and in some cases extending - the already existing software components. We particularly appreciate its high customizability, because we use the framework for classic discovery crawling, which we need to feed the </span><i style="font-family: arial; white-space-collapse: preserve;">Open Web Index,</i><span style="font-family: arial; white-space-collapse: preserve;"> and also for more task-specific and research-oriented crawling.</span></p><p></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: justify;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="font-family: arial;">A major challenge in our work is the heterogeneous infrastructure, on top of which we are building the crawling system. The different infrastructure partners in the project provide a large set of commodity hardware, which is hosted across different datacenters and dispersed over Europe. Despite the geographic distribution of the machines, all nodes should collaborate on the same shared crawl. For that purpose, we deploy <a href="http://urlfrontier.net/" target="_blank">URLFrontier</a> in a central computing site. The Frontier services distribute the crawl space and communicate with the remote crawlers in order to provide them with a continuous flow of URLs to be fetched (see figure below). URLFrontier can use different backends to store the data, we chose to use one leveraging another open source project, <a href="http://opensearch.net/">OpenSearch</a>. </span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: center;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="border: none; display: inline-block; height: 287px; overflow: hidden; width: 328px;"><span style="font-family: arial;"><img height="287" src="https://lh7-us.googleusercontent.com/g0zX4-Z_KqsqnE5dAl7IbdQSx58RogZrfTxEKoaF83szzrIEfdFNjH9uP6SApLN5O4Vp_sIBD7iNb7b6ky-KSfoqbmJFmGrxqvdf0bPFpesuplWN-ZjZBIK-e_JF9Av4SW7G80WH2cs1CJeX7l-tFXE" style="margin-left: 0px; margin-top: 0px;" width="328" /></span></span></span></p></span><h2 style="text-align: left;"><span style="font-family: arial; white-space-collapse: preserve;"><span style="font-size: medium;">What results did you get so far?</span></span></h2><span><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: justify;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="font-family: arial;">The crawling is currently still in its experimental phase, but fortunately, we have already achieved some interesting and promising numbers. For example, we are running three StormCrawler instances at the moment. These have fetched over 200M web pages within a single week and each of them produced between 200 and 250 GiB of WARC files per day. The crawled data is filtered and enriched with meta information, before it is provided as index and metadata files to the public. In the next steps, we want to upscale the crawling to several terabytes and improve the prioritisation of crawl URLs to get a strong focus on high-quality pages.</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: justify;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">It is definitely worth mentioning that the WARC module of StormCrawler helped us a lot. In order to get our indexing pipeline going, we started with copying WARC files from </span><a href="https://commoncrawl.org/" style="text-decoration-line: none;"><span style="color: #1155cc; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space-collapse: preserve;">CommonCrawl</span></a><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">, before we were able to crawl on our own.</span></span></p></span><h2 style="text-align: left;"><span style="font-family: arial; white-space-collapse: preserve;"><span style="font-size: medium;">Why did you choose StormCrawler?</span></span></h2><span><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="font-family: arial;">We chose StormCrawler primarily for its compatibility with URLFrontier. This synergy made it an excellent starting point for developing a large-scale, coordinated, and distributed crawling cluster. Additionally, the open-source nature of the project and its active community influenced our decision. It was crucial for us to be supported by a network of developers who continuously enhance the core software and provide assistance or solutions when needed.</span></span></p></span><h2 style="text-align: left;"><span style="font-size: medium;"><span style="font-family: arial; white-space-collapse: preserve;">Did you make any contributions to it? Any advice you could give to future users and contributors?</span></span></h2><span><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">Yes, we have contributed to StormCrawler by creating a forked version named </span><a href="https://openwebsearch.eu/owler/" style="text-decoration-line: none;"><span style="color: #1155cc; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space-collapse: preserve;">OWLer</span></a><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">.
</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: arial;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;">This version includes several improvements and additions we deemed necessary for our project. We've implemented extended topologies for various purposes and added a classification component to categorise and annotate URLs based on either just the URL or the URL plus website content. It serves as a labelling tool for the crawler's content. </span></span></p><span style="font-family: arial;"><br /></span><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="font-family: arial;">URLFrontier has also been expanded to accommodate these modifications, enabling crawlers to specialise in topics, languages, genres, etc. </span></span></p><span style="font-family: arial;"><br /></span><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="font-family: arial;">Moreover, we have introduced a "Crawling-On-Demand" service. Users can register their requests on the new OWler webpage by specifying a list of seed URLs and additional information. Upon submission, a StormCrawler instance is deployed in our infrastructure, fetching and storing the content as WARC files in a dedicated S3 bucket. Once completed, users receive a link to download the WARC files via email. URLFrontier tracks the progress of these crawls.</span></span></p></span><h2 style="line-height: 1.38; margin-bottom: 14pt; margin-top: 14pt; text-align: left; white-space: pre;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; text-wrap: wrap; vertical-align: baseline;"><span style="font-family: arial; font-size: medium;">What's next?</span></span></h2><span><p style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-variant-position: normal; vertical-align: baseline; white-space-collapse: preserve;"><span style="font-family: arial;">We are currently expanding our "Crawling-On-Demand" service to include "Indexing-On-Demand." Users will be able to specify a list of seed URLs and additional tags. We will then search our database of previously crawled and processed URLs for recent content matching this list and provide it to the user in an indexed format.
<span style="font-size: medium;">
</span></span></span></p><p style="color: #222222; text-align: left;"><span style="font-family: arial; font-size: medium;">LinkedIn: <a href="https://www.linkedin.com/company/openwebsearch-eu" target="_blank">openwebsearch-eu</a><br />X: <a href="https://twitter.com/OpenWebSearchEU" target="_blank">OpenWebSearchEU</a><br />Mastodon: <a href="https://suma-ev.social/@openwebsearcheu">@openwebsearcheu@suma-ev.<wbr></wbr>social</a></span></p></span>Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-26759008018637012852023-10-26T07:40:00.003+01:002023-10-26T07:40:31.930+01:00Focus on protocol improvements in StormCrawler 2.10<p><span style="font-size: medium;"><span style="font-family: arial;"><a href="https://github.com/DigitalPebble/storm-crawler/releases/tag/2.10">StormCrawler 2.10</a> was released yesterday and, as usual, it contains loads of improvements, dependency upgrades and bug fixes. Instead of going through each one of them, we will focus specifically on what was done for <b>protocols</b>.<br /></span><span style="font-family: arial;"><br />First, every protocol implementation can now easily be tested on the command line, even <i>FileProtocol</i> or <i>DelegatorProtocol</i> thanks to <a href="https://github.com/DigitalPebble/storm-crawler/issues/1097">#1097</a>. For instance, </span></span></p><p><span style="color: #666666; font-family: courier;">storm local target/xxx-1.0-SNAPSHOT.jar com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol -f crawler-conf.yaml https://storm.apache.org/ -b</span></p><div><span style="font-family: arial; font-size: medium;">which configures the RemoteDriverProtocol with the content of <i>crawler-conf.yaml</i> and display info on the console.<br /><br /><div style="text-align: justify;"><span style="font-family: arial; font-size: medium;">You might have noticed that the option to specify a configuration file has changed from -c to <b>-f</b> as the former conflicted with a Storm operator. </span><span style="font-size: medium;"><span style="font-family: arial;">We also added an option</span></span></div></span></div><div style="text-align: justify;"><span style="font-size: medium;"><span style="font-family: arial;"><b>-b </b></span></span><span style="font-family: arial; font-size: medium;">which dumps the content of the URL to a file in the temp folder, making it very easy to check what the protocol actually retrieved for a given configuration.</span><br /><br /><span style="font-family: arial; font-size: medium;">Using the command above in combination with debugging is particularly powerful. This can easily be done with </span><br /><br /><div style="text-align: left;"><span style="color: #666666; font-family: courier; font-size: x-small;">export STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000"</span></div></div><p style="text-align: justify;"><span style="font-size: medium;"><span style="font-family: arial;">One of the main changes is about the <b>Selenium</b> module. It had been a while since it had any work done to it and its configuration was pretty obsolete. With <a href="https://github.com/DigitalPebble/storm-crawler/pull/1093">#1093</a>, we added some much needed unit tests and removed some incompatible configuration. </span></span><span style="font-family: arial; font-size: medium;"><a href="https://github.com/DigitalPebble/storm-crawler/issues/1100">#1100</a> added an option to deactivate tracing and we fixed the user agent substitution (<a href="https://github.com/DigitalPebble/storm-crawler/issues/1109">#1109</a>).<br /><br />An important and incompatible change in the Selenium module is about the way the timeouts are configured. The previous mechanism was opaque and error prone. This has been replaced in <a href="https://github.com/DigitalPebble/storm-crawler/issues/1101">#1101</a>, the timeouts are now configured with a map</span></p><p style="text-align: justify;"><span style="color: #666666;"><span style="font-family: courier; font-size: x-small;"> selenium.timeouts:<br /></span><span style="font-family: courier; font-size: small;"> script: -1<br /></span><span style="font-family: courier; font-size: small;"> pageLoad: -1<br /></span><span style="font-family: courier; font-size: small;"> implicit: -1</span></span></p><p style="text-align: justify;"><span style="font-family: arial; font-size: large;">with -1 preserving the Selenium default values.<br /></span><br /><span style="font-family: arial; font-size: medium;">The <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/protocol/DelegatorProtocol.java">DelegatorProtocol</a> has also been greatly improved. If you are not familiar with it, it allows you to determine which protocol implementation should be used for a URL given the metadata it has. For instance, </span></p><pre class="notranslate" style="border-radius: 6px; box-sizing: border-box; font-family: ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px; line-height: 1.45; margin-bottom: 16px; margin-top: 0px; overflow-wrap: normal; overflow: auto; padding: 16px;"><code class="notranslate" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; border-radius: 6px; border: 0px; box-sizing: border-box; display: inline; font-family: ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation Mono", monospace; font-size: 11.9px; line-height: inherit; margin: 0px; overflow-wrap: normal; overflow: visible; padding: 0px; word-break: normal;"><span style="color: #666666;"> # use the normal protocol for sitemaps
protocol.delegator.config:
- className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
filters:
isSitemap: "true"
- className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"</span></code></pre><p style="text-align: justify;"><span style="font-family: arial; font-size: large;">will use the OKHTTP protocol for a URL if is has a key <i>isSitemap</i> in its metadata with a value of <i>true</i>. Otherwise it will use the Selenium implementation.<br /><br />With <a href="https://github.com/DigitalPebble/storm-crawler/issues/1098" target="_blank">#1098</a>, we added an operator indicating whether the conditions should be treated as an AND or OR. We also added the possibility to triage based on regular expressions on the URL itself (<a href="https://github.com/DigitalPebble/storm-crawler/issues/1110">#1110</a>). <br /><br />You can now express more complex configurations such as <br /></span></p><div><span style="color: #666666; font-family: courier; font-size: x-small;"> # use the normal protocol for sitemaps, robots and if asked explicitly</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> protocol.delegator.config:</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> operator: OR</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> filters:</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> isSitemap: "true"</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> robots.txt:</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> skipSelenium:</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> regex:</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> - \.pdf</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> - \.doc</span></div><div><span style="color: #666666; font-family: courier; font-size: x-small;"> - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"</span></div><p style="text-align: justify;"><span style="font-family: arial; font-size: medium;">As a result, we removed the deprecated class <i>DelegatorRemoteDriverProtocol.</i><br /><br />The DelegatorProtocol is of course particularly useful for avoiding sending URLs to the Selenium implementation unnecessarily, as illustrated above.</span></p><p style="text-align: justify;"><span style="font-family: arial; font-size: medium;"><br />StormCrawler 2.10 contains of course other changes and dependency updates and, as usual, we recommend that you switch to it. As we have seen today, the improvements we added to make protocol implementations easier to test and configure should be a reason to upgrade.<br /><br />We would like to thank all the users and contributors to the 2.10 release.<br /><br />Happy crawling!</span></p><p style="text-align: justify;"><span style="font-family: arial; font-size: medium;"><br /></span></p><p style="text-align: justify;"><span style="font-family: arial; font-size: medium;"><br /></span></p><p style="text-align: justify;"><span style="font-family: arial; font-size: medium;"><br /></span></p><p style="text-align: justify;"><span style="font-family: arial; font-size: medium;"> </span><br /><br /></p>Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-50344482484092763172022-03-22T10:00:00.004+00:002022-03-22T10:00:40.711+00:00What's new in StormCrawler 2.3<div style="text-align: left;"><span style="font-family: arial;"><a href="https://github.com/DigitalPebble/storm-crawler/releases/tag/2.3">StormCrawler 2.3</a> was released yesterday. It contains a relatively small number of changes compared to previous releases but these include important bug fixes. We have also ported existing ParseFilters to JSoupParseFilters, leading to some noticeable performance improvements and an exuberant tweet<br /><br /><blockquote class="twitter-tweet"><p dir="ltr" lang="en">So much for the "no-coding-Friday" but this is a bomb<a href="https://t.co/8VIyChtelK">https://t.co/8VIyChtelK</a><br /><br />If the parsing bolt in your <a href="https://twitter.com/hashtag/StormCrawler?src=hash&ref_src=twsrc%5Etfw">#StormCrawler</a> topology is a bit slow, you should definitely have a look at this one.<br /><br />(and a big thank you to <a href="https://t.co/UGUVdHl0W1">https://t.co/UGUVdHl0W1</a>)</p>— DigitalPebble (@digitalpebble) <a href="https://twitter.com/digitalpebble/status/1494711741808361475?ref_src=twsrc%5Etfw">February 18, 2022</a></blockquote> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script><br />We also welcomed <span style="font-size: 14.6667px; text-align: justify; white-space: pre-wrap;">Richard Zowalla as a new committer on the project.</span><br /><br />Here are the main changes.</span></div><h2 style="text-align: left;"><span style="font-family: arial;">Dependency upgrades</span></h2><p></p><ul style="text-align: left;"><li><span style="font-family: arial;">Elasticsearch 7.17.0 </span></li><li><span style="font-family: arial;">Tika 2.3.0</span></li><li><span style="font-family: arial;">Caffeine 2.9.3 </span></li></ul><p></p><h2 style="text-align: left;"><span style="font-family: arial;">Core</span></h2><p></p><ul style="text-align: left;"><li><span style="font-family: arial;">Convert LinkParseFilter into a JSoupFilter (<a href="https://github.com/DigitalPebble/storm-crawler/issues/944">#944</a>)<br /></span></li><li><p><span style="font-family: arial;">Rewrote LinkParseFilter + added XPathFilter + tests for JSOUPFilters (<a href="https://github.com/DigitalPebble/storm-crawler/pull/953">#953</a>)</span></p></li><li><p><span style="font-family: arial;">General Code Refactoring and Good Practices (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/937" style="font-family: arial;">#937</a><span style="font-family: arial;">)</span></p></li><li><p><span style="font-family: arial;">Add unified way of initializing classes via string … (<a href="https://github.com/DigitalPebble/storm-crawler/pull/943">#943</a>)<br /></span></p></li><li><p><span style="font-family: arial;">Changed order of emit outlinks and emit of parent url ... (<a href="https://github.com/DigitalPebble/storm-crawler/issues/954">#954</a>) </span></p></li></ul><h2><span style="font-family: arial;">Elasticsearch </span></h2><p></p><ul style="text-align: left;"><li><span style="font-family: arial;">Enable compression (<a href="https://github.com/DigitalPebble/storm-crawler/issues/941">#941</a>) </span></li><li><p><span style="font-family: arial;">Enable _source for content index in ES archetype (<a href="https://github.com/DigitalPebble/storm-crawler/issues/958">#958</a>) </span></p></li></ul><h2 style="text-align: left;"><span style="font-family: arial;">URLFrontier</span></h2><div style="text-align: left;"><ul style="text-align: left;"><li><span style="font-family: arial;">Spout does not reconnect to URLFrontier if an exception occurs (<a href="https://github.com/DigitalPebble/storm-crawler/issues/956">#956</a>) </span></li></ul></div><p><span style="font-family: arial;">The next release will probably include a new module for Elasticsearch 8, see <a href="https://github.com/DigitalPebble/storm-crawler/issues/945">#945</a>. If you have some experience of using ES new client library, your contribution will be very welcome.<br /><br />Thank you to all users and contributors, in particular Felix Engl for his work on the code refactoring and Julian Alvarez for reporting and fixing the bug in <a href="https://github.com/DigitalPebble/storm-crawler/issues/954">#954</a>.<br /><br />Our users Gage Piracy have also been very generous in donating some of the customisations we wrote for them back to the project.<br /><br />Happy crawling!</span></p>Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-71140225978921500532022-03-21T13:59:00.006+00:002022-03-21T14:00:20.689+00:00Unlock your web crawl with URLFrontier<p> <span style="font-family: Arial; font-size: 11pt; text-align: justify; white-space: pre-wrap;">Our guest writer today is Richard Zowalla. </span></p><span id="docs-internal-guid-0c9f1178-7fff-4ff4-6df4-bc32f087261b"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">
</span><span style="border: none; display: inline-block; height: 126px; overflow: hidden; width: 139px;"><img height="126" src="https://lh3.googleusercontent.com/CLnyiBRsStUrq3xny-DBliOJ3dzMT8C1S7LVdQ8lODn5_gPXfrpdZ8Uw1rbd1v4fpoxBC_EiW14EzQhi50az_kr5nBcNfPNETNli2inGEUjHN1d14Ep30BlefCMouvq6V5qGvH1Y" style="margin-left: 0px; margin-top: 0px;" width="139" /></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;">Richard is a committer on StormCrawler, CrawlerCommons and other open source projects such as Apache TomEE. He is a PhD student in the field of medical web data mining. His recent work </span><a href="https://www.jmir.org/2020/7/e17853/" style="font-family: "Times New Roman"; font-size: medium; text-decoration-line: none; white-space: normal;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">“Crawling the German Health Web”</span></a><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;"> was published in the Journal of Medical Internet Research and is about using </span><a href="http://stormcrawler.net/" style="font-family: "Times New Roman"; font-size: medium; text-decoration-line: none; white-space: normal;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">StormCrawler</span></a><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;"> as a focused web crawler to collect a large sample of the German Health Web.</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">
Richard will now tell us about his experimentation with </span><a href="http://urlfrontier.net/" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">URLFrontier</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> and crawler4j. As you probably know, URLFrontier is a project sponsored by the </span><a href="https://nlnet.nl/" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">NLNet foundation</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> that we, at DigitalPebble, have been working on for just over a year and it is now in its second iteration. Let’s start by explaining what it is all about…</span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt; text-align: justify;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">What is URLFrontier?</span></h2><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Web crawlers need to store the information about the URLs they process, this is called a </span><a href="https://en.wikipedia.org/wiki/Crawl_frontier" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">crawl frontier</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">. Typically, each web crawling software has its own way of implementing this. Our very own </span><a href="http://stormcrawler.net/" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">StormCrawler</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;"> is no exception, except that it is not tied to one specific backend but can use several implementations like Elasticsearch, SOLR or SQL.</span></h2><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What </span><a href="https://github.com/crawler-commons/url-frontier" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">URLFrontier</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> does is to provide a crawler/language-neutral API for the operations that web crawlers do when communicating with a crawl frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etc... </span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">URLFrontier is based on </span><a href="https://en.wikipedia.org/wiki/GRPC" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">gRPC</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> and provides not only an API but also an implementation of the service and client code in Java that can be used to communicate with it. Because the API and implementations are based on gRPC, URLFrontier can be used by web crawlers regardless of the programming language they are written in. As you would expect, StormCrawler has a </span><a href="https://github.com/DigitalPebble/storm-crawler/tree/master/external/urlfrontier" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">module for URLFrontier</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, which was used extensively last year in a large-scale crawl described </span><a href="https://www.ngi.eu/blog/2022/02/10/whos-ngi-julien-nioche-with-open-source-web-crawler-url-frontier/?utm_content=buffer5318f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">here</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">.</span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">By externalising the frontier logic from web crawlers, we can reuse the same implementation across different web crawlers and can make it better as a community instead of having each crawler project constantly reinventing the wheel. It also helps modularizing a crawler setup and make it distributed.</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Let’s now see what Richard has been up to. </span></h2><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">The crawler4j framework</span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Crawler4j is an open source web crawler written in Java, which provides a simple interface for crawling the Web in a single process. Sadly, the </span><a href="https://github.com/yasserg/crawler4j" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">original (academic) project</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> became mostly inactive with its last release in 2018 leaving users only two options: (1) migrate to another crawler framework or (2) maintain a fork of the library and release it to Maven Central. In the end, we decided to do the latter and forked the </span><a href="https://github.com/rzo1/crawler4j" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">repository</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> to continue using crawler4j within our academic research projects.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="background-color: white; color: #0d1117; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">As setting up a multi-threaded web crawler with crawler4j is fairly simple, using a fully distributed web crawler would have been overkill for our small use-cases (i.e. focus on fetching single web sites)</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">. Therefore, we decided to maintain our own fork with up-to-date libraries and the possibility to (easily) switch between different frontier implementations as </span><a href="https://en.wikipedia.org/wiki/Sleepycat_License" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">Oracle’s Sleepycat licence</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> does not comply with some of our use-cases.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">To start with crawler4j, you need to choose from one of the available crawl frontier implementations:</span></p><br /><ul style="margin-bottom: 0px; margin-top: 0px; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><a href="https://en.wikipedia.org/wiki/Berkeley_DB" style="text-decoration-line: none;"><span style="color: #1155cc; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">Sleepycat</span></a><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> a.k.a. Berkley DB (Key-Value-based)</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><a href="https://en.wikipedia.org/wiki/HSQLDB" style="text-decoration-line: none;"><span style="color: #1155cc; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">HSQLDB</span></a><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> (SQL-based)</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><a href="https://github.com/crawler-commons/url-frontier" style="text-decoration-line: none;"><span style="color: #1155cc; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">URLFrontier</span></a></p></li></ul><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The HSQLDB and URLFrontier frontier implementations are only available in our fork. They aim to mitigate the rather strict licensing policies of Sleepycat.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">After choosing a crawl frontier implementation, you can simply add the required dependency via Maven to your project (here: we choose URLFrontier):</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #0d1117; font-family: "Courier New"; font-size: 10pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> <dependency></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #0d1117; font-family: "Courier New"; font-size: 10pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> <groupId>de.hs-heilbronn.mi</groupId></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #0d1117; font-family: "Courier New"; font-size: 10pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> <artifactId>crawler4j-with-urlfrontier</artifactId></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #0d1117; font-family: "Courier New"; font-size: 10pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> <version>4.8.2</version></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #0d1117; font-family: "Courier New"; font-size: 10pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> <type>pom</type></span></p><p dir="ltr" style="line-height: 1.74; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #0d1117; font-family: "Courier New"; font-size: 10pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </dependency></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Next, you have to create a crawler class which extends WebCrawler. This class decides which URLs should be crawled and handles the fetched web pages. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">public class </span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">FrontierWebCrawler </span><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">extends </span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WebCrawler {</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #bbb529; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">@Override</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #bbb529; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">public boolean </span><span face="Consolas, sans-serif" style="color: #ffc66d; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">shouldVisit</span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">(Page referringPage</span><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WebURL url) {</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> // determines, if a given URL should be visited by the crawler</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">return </span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">true</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">}</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #bbb529; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">@Override</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #bbb529; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">public void </span><span face="Consolas, sans-serif" style="color: #ffc66d; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">visit</span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">(Page page) {</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #9876aa; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">//handle a fetched page, e.g. store it</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #cc7832; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">}</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span face="Consolas, sans-serif" style="color: #a9b7c6; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">}</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In addition, you need to implement a controller class which specifies the seeds for the web crawl, the folder in which crawler4j will store intermediate crawl data and some other config options such as the number of crawler threads or if the web crawler should be polite and/or honour the robots exclusion protocol. This can be done like this:</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">protected </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">CrawlController </span><span style="color: #ffc66d; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">init</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">() </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">throws </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Exception {</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">CrawlConfig config = </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">new </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">CrawlConfig()</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">config.setCrawlStorageFolder(</span><span style="color: #9876aa; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">“/tmp”</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">config.setPolitenessDelay(</span><span style="color: #6897bb; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">800</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">config.setMaxDepthOfCrawling(</span><span style="color: #6897bb; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">3</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">; </span><span style="color: grey; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">config.setIncludeBinaryContentInCrawling(</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">false</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">config.setResumableCrawling(</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">true</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">config.setHaltOnError(</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">false</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">BasicURLNormalizer normalizer = BasicURLNormalizer.</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">newBuilder</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">().idnNormalization(BasicURLNormalizer.IdnNormalization.</span><span style="color: #9876aa; font-family: "Courier New"; font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">NONE</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">).build()</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">PageFetcher pageFetcher = </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">new </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">PageFetcher(config</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">normalizer)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">RobotstxtConfig robotstxtConfig = </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">new </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">RobotstxtConfig()</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: grey; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">robotstxtConfig.setSkipCheckForSeeds(</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">true</span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">; </span><span style="color: grey; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">// we skip the robots checks for adding seeds (will be checked later on demand)</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: grey; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">int maxQueues = 10;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: grey; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">int port = 10;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: grey; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">FrontierConfiguration frontierConfiguration = </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">new </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">URLFrontierConfiguration(config</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">maxQueues</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #6a8759; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">"localhost"</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">port)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> final </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">RobotstxtServer robotstxtServer = </span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">new </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">RobotstxtServer(robotstxtConfig</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">pageFetcher</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">frontierConfiguration.getWebURLFactory())</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> return new </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">CrawlController(config</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">normalizer</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">pageFetcher</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">robotstxtServer</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">frontierConfiguration)</span><span style="color: #cc7832; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">;</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #a9b7c6; font-family: "Courier New"; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">}</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Seeds can then be added via the CrawlController. To increase performance, you can skip the robots.txt check while adding new seeds.</span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Crawler4j in </span><span face="Roboto, sans-serif" style="color: #373637; font-size: 26pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">♥ </span><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">with URLFrontier</span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The integration of URLFrontier in crawler4j basically boils down to three (adapter) classes and some boilerplate code to connect with the gRPC code provided by URLFrontier. This reduces the amount of crawler logic to handle the crawl frontier significantly. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">As URLFrontier handles duplicate URLs and acts as a remote crawl frontier, it is now fairly simple to run crawler4j on different machines. URLFrontier then acts as the single point of synchronisation. Consequently, this approach can turn crawler4j into a simple distributed web crawler. Without a remote frontier (like URLFrontier), we would have had to implement a custom distributed URLFrontier using a framework like Hazelcast in order to distribute crawler4j’s crawl frontier. In both cases, distributing crawler4j comes at the cost that we need to implement additional business logic to handle or store the fetched Web pages in a distributed way. Nevertheless, the ease to implement a web crawler with crawler4j outweighs this issue.</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The default </span><a href="https://github.com/crawler-commons/url-frontier/tree/master/service" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">URLFrontier service implementation</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> is based on RocksDB and it is publicly available as a </span><a href="https://hub.docker.com/r/crawlercommons/url-frontier" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">Docker image</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">.</span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Experimenting with different frontier implementations</span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">For our experiment, we relied on three virtual machines (VMs). Each VM is equipped with 4 vCPU, 10GB of memory and is running on Ubuntu 20.04 LTS with latest OpenJDK 17. We used a </span><span style="background-color: white; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">seed list of 1M URLs generated from the site rankings computed by</span><span style="background-color: white; color: #3c4043; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://commoncrawl.org/" style="text-decoration-line: none;"><span style="background-color: white; color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">CommonCrawl</span></a><span style="background-color: white; color: #3c4043; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Each Web crawl was started simultaneously on each VM and was run for an exact duration of 48 hours. We limited the crawling depth per URL to 3. URLFrontier was run as a docker container residing on the same VM as the crawler. Every 30 seconds, we checked the amount of processed (i.e. completed) URLs. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Note, that we did not apply any further processing of fetched Web pages as this wasn’t in the scope of our experiment. The example’s code is available on </span><a href="https://github.com/rzo1/crawler4j-frontier-battle" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">GitHub</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">.</span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Results</span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">On average, the crawler4j framework was capable of downloading up to 90 web pages per minute with a politeness delay of 800ms between each request to the same host. The detailed statistics are:</span></p><br /><ul style="margin-bottom: 0px; margin-top: 0px; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Sleepycat: fetched ~ 90 pages / min; </span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">URLFrontier: fetched ~ 72 pages / min; </span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">HSQLDB: fetched: ~ 68 pages / min; </span></p></li></ul><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Figure 1 depicts the number of processed (i.e. fetched) URLs over the time period of 48 hours. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 411px; overflow: hidden; width: 602px;"><img height="411" src="https://lh5.googleusercontent.com/oT1D7-MdkukM23GcvhwDKrgXFqbxiELEA1uavTMbTkSS6i0Y7Va9ZDoMiryZdktrKmiPqBD9JDZgk4bmWiy3l8tNvDp6JZ1jqubY5w8kvCogqhpntKcey4dw4rU5cIp8YjCcYKZ9" style="margin-left: 0px; margin-top: 0px;" width="602" /></span></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Overall, there is a noticeable difference between Sleepycat, URLFrontier and the HSQLDB frontier implementation. However, HSQLDB is only a few pages slower than the URLFrontier implementation. As can be seen from the aggregate numbers, Sleepycat is faster compared to the other implementations. We can assume that the proprietary Sleepycat communication protocol outperforms gRPC (URLFrontier) and JDBC (HSQLDB) calls by not adding too much communication overhead.</span></p><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Conclusion</span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Performance aside, one benefit of using StormCrawler is that the code needed to integrate it in crawler4j boils down two only three classes while the other two implementations required a significantly more complex integration. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In addition, by adopting URLFrontier as a backend, it is possible to easily exchange the crawler implementation and re-use the same data as before. We also benefit from any improvements to the service implementation without the need to change a single line of code. In particular, the </span><a href="https://github.com/crawler-commons/url-frontier/wiki/Roadmap-URLFrontier-2" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">forthcoming versions of URLFrontier</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> should contain some very useful features.</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Another important advantage of URLFrontier is that it opens the content of the frontier to the outside world: You can manipulate or view the content of the frontier during an ongoing web crawl via the CLI. This is not possible for the other frontier implementations.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Overall, our experiment showed that URLFrontier is slower than the original Sleepycat implementation, which most likely originates from the overhead introduced by the gRPC calls to communicate with URLFrontier. This is also true for the JDBC-based HSQLDB implementation. On the plus side, URLFrontier does not suffer from (commercial) licensing issues such as Sleepycat and can turn crawler4j into a simple distributed web crawler with little additional work, unlike using the other two implementations. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The figures given in this post depend on the particular seed list, the ordering of URLs, and the hardware used for the experiment. Therefore, you might get different results for your specific use case. The resources and configurations of this experiment being publicly available, you can try to reproduce it and extend it as you wish. </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Next steps</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">This experiment has been very successful and informative and we are hoping to run more benchmarks in the future, like for instance a larger scale crawling in fully distributed mode.</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">URLFrontier is getting many improvements in its current phase of development and we are beginning to see alternative implementations of the service, like </span><a href="https://github.com/PresearchOfficial/opensearch-frontier" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">this one based on Opensearch</span></a><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">. We are also seeing the project gain some traction with existing web crawlers. </span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">An alternative experiment would be to compare the performance of the different URLFrontier service implementations available. Exciting times ahead!</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Happy crawling everyone and a massive thank you to Richard for being our guest writer. </span></p><div><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div></span>Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-5891323919048551662022-01-11T17:28:00.004+00:002022-01-11T17:29:16.827+00:00What's new in StormCrawler 2.2<h2><span style="font-family: arial; font-size: small; white-space: pre-wrap;"><p style="font-weight: 400; text-align: justify; white-space: normal;"><span><span style="white-space: pre-wrap;"><a href="https://stormcrawler.net" target="_blank">StormCrawler</a> 2.2 has just been released. This marks the beginning of having releases only for 2.x, 1.18 was the last release for the 1.x branch which is now discontinued. In case you were wondering why there was no "<i>What's new in StormCrawler 2.1</i>", it is simply that it contained the same modifications as 1.18 and did not get its own announcement.
</span></span></p><p style="font-weight: 400; white-space: normal;"><span style="white-space: pre-wrap;">This version contains many bugfixes, as usual, users are advised to upgrade to this version.</span></p><p style="font-weight: 400; white-space: normal;"><span style="white-space: pre-wrap;">Happy crawling and thanks to our </span><a href="https://github.com/sponsors/DigitalPebble" style="white-space: pre-wrap;" target="_blank">sponsors</a><span style="white-space: pre-wrap;">, contributors and users!
PS: I a</span><span style="white-space: pre-wrap;">m tempted to run a workshop on </span><span class="r-18u37iz" style="-webkit-box-direction: normal; -webkit-box-orient: horizontal; background-color: rgba(0, 0, 0, 0.008); color: #0f1419; flex-direction: row; white-space: pre-wrap;">webcrawling</span><span class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0" style="background-color: rgba(0, 0, 0, 0.008); border: 0px solid black; box-sizing: border-box; color: #0f1419; display: inline; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 0px; min-width: 0px; overflow-wrap: break-word; padding: 0px; white-space: pre-wrap;"> with </span><span class="r-18u37iz" style="-webkit-box-direction: normal; -webkit-box-orient: horizontal; background-color: rgba(0, 0, 0, 0.008); color: #0f1419; flex-direction: row; white-space: pre-wrap;">StormCrawler</span><span class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0" style="background-color: rgba(0, 0, 0, 0.008); border: 0px solid black; box-sizing: border-box; color: #0f1419; display: inline; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 0px; min-width: 0px; overflow-wrap: break-word; padding: 0px; white-space: pre-wrap;"> at the </span><span class="r-18u37iz" style="-webkit-box-direction: normal; -webkit-box-orient: horizontal; background-color: rgba(0, 0, 0, 0.008); color: #0f1419; flex-direction: row; white-space: pre-wrap;">BigData</span><span class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0" style="background-color: rgba(0, 0, 0, 0.008); border: 0px solid black; box-sizing: border-box; color: #0f1419; display: inline; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 0px; min-width: 0px; overflow-wrap: break-word; padding: 0px; white-space: pre-wrap;"> conference in Vilnius in November. </span><span style="background-color: rgba(0, 0, 0, 0.008); color: #0f1419; white-space: pre-wrap;">Anyone interested? If so please get in touch and let me know what you'd like to learn about. </span><a class="css-4rbku5 css-18t94o4 css-901oao css-16my406 r-1cvl2hr r-1loqt21 r-poiln3 r-bcqeeo r-qvutc0" dir="ltr" href="https://t.co/YDNAjUM9KB" rel="noopener noreferrer" role="link" style="background-color: rgba(0, 0, 0, 0.008); border: 0px solid black; box-sizing: border-box; color: #1d9bf0; cursor: pointer; display: inline; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; list-style: none; margin: 0px; min-width: 0px; overflow-wrap: break-word; padding: 0px; text-decoration-line: none; white-space: pre-wrap;" target="_blank"><span aria-hidden="true" class="css-901oao css-16my406 r-poiln3 r-hiw28u r-qvk6io r-bcqeeo r-qvutc0" color="inherit" style="border: 0px solid black; box-sizing: border-box; display: inline; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 0px; margin: 0px; min-width: 0px; overflow-wrap: break-word; padding: 0px; white-space: inherit;">https://</span>https://bigdataconference.eu/</a><br /></p><a name='more'></a>
Dependency upgrades </span></h2><div><span style="font-family: arial;"><span style="white-space: pre-wrap;"><span>See individual upgrades </span>in </span><span style="white-space: pre;"><a href="https://github.com/DigitalPebble/storm-crawler/issues/914" target="_blank">#914</a></span></span></div><div><span style="font-family: arial; white-space: pre-wrap;"><br /></span></div><ul style="margin-bottom: 0px; margin-top: 0px; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: arial;"><span style="white-space: normal;">Storm 2.3.0 <a href="https://github.com/DigitalPebble/storm-crawler/issues/911" target="_blank">#911</a></span></span></p></li><li aria-level="1" dir="ltr" style="font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: arial;"><span style="white-space: normal;">Log4j 2.17.0 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/936" style="white-space: normal;">#936</a><br /><br /></span></p></li></ul><span style="font-family: arial;"><div style="text-align: justify;"><span style="white-space: pre-wrap;">As of writing, Apache Storm has not released a version containing a fix for the </span><span style="white-space: pre-wrap;">Log4J vulnerability - CVE-2021-44228 (see <a href="https://github.com/apache/storm/pull/3427#issuecomment-1006010073" target="_blank">discussion</a>). It is however possible to patch a running version of Storm <a href="https://github.com/DigitalPebble/storm-crawler/issues/935#issuecomment-992962000" target="_blank">as explained by Sebastian</a>.</span></div></span>
<br /><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: arial; font-size: small; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Core</span></h2><p></p><ul style="text-align: left;"><li><span style="font-family: arial;">StackOverFlow issue in CharsetIdentification <a href="https://github.com/DigitalPebble/storm-crawler/issues/895" target="_blank">#895</a></span></li><li><span style="font-family: arial;">OkHttp protocol: make connection pool configurable <a href="https://github.com/DigitalPebble/storm-crawler/issues/918">#918</a></span></li><li><span style="font-family: arial;">Remove selenium.instances.num <a href="https://github.com/DigitalPebble/storm-crawler/issues/933">#933</a></span></li><li><span style="font-family: arial;">Changed ProtocolFactory to be a singleton <a href="https://github.com/DigitalPebble/storm-crawler/issues/932">#932</a></span></li><li><span style="font-family: arial;">Need to register Status class with Kryo <a href="https://github.com/DigitalPebble/storm-crawler/issues/924">#924</a></span></li><li><span style="font-family: arial;">JSoupParserBolt cannot configure more than one JSoupFilters per worker <a href="https://github.com/DigitalPebble/storm-crawler/issues/925">#925</a></span></li><li><span style="font-family: arial;">Remove static keyword on JSoupFilters field <a href="https://github.com/DigitalPebble/storm-crawler/issues/927">#927</a></span></li><li><span style="font-family: arial;">Support HEAD method in okhttp protocol <a href="https://github.com/DigitalPebble/storm-crawler/issues/923">#923</a></span></li><li><span style="font-family: arial;">Allow to set http.content.limit per page in metadata <a href="https://github.com/DigitalPebble/storm-crawler/pull/922" target="_blank">#922</a></span></li><li><span style="font-family: arial;">OkHttp protocol: add support for Brotli compression (Content-Encoding) <a href="https://github.com/DigitalPebble/storm-crawler/pull/919" target="_blank">#919</a></span></li><li><span style="font-family: arial;">Protocols: Integer.MAX_VALUE not save as max. content size <a href="https://github.com/DigitalPebble/storm-crawler/pull/854" target="_blank">#854</a></span></li><li><span style="font-family: arial;">Protocols: adding support for custom headers <a href="https://github.com/DigitalPebble/storm-crawler/pull/912" target="_blank">#912</a></span></li><li><span style="font-family: arial;">Replace Guava caches with Caffeine <a href="https://github.com/DigitalPebble/storm-crawler/pull/903" target="_blank">#903</a> and <a href="https://github.com/DigitalPebble/storm-crawler/pull/905" target="_blank">#905</a></span></li><li><span style="font-family: arial;">DelegatorProtocol <a href="https://github.com/DigitalPebble/storm-crawler/pull/900" target="_blank">#900</a> </span></li><li><span style="font-family: arial;">Fixed bug with StackOverflowError in fast charset identification <a href="https://github.com/DigitalPebble/storm-crawler/pull/895" target="_blank">#895 </a></span></li><li><span style="font-family: arial;">Multi proxy support <a href="https://github.com/DigitalPebble/storm-crawler/pull/890" target="_blank">#890</a></span></li></ul><p></p><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: arial; font-size: small; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Elasticsearch</span></h2><div><ul style="text-align: left;"><li><span style="font-family: arial;">ES Spout to connect to local shards when available <a href="https://github.com/DigitalPebble/storm-crawler/pull/852" target="_blank">#852</a></span></li><li><span style="font-family: arial;">Issue with ConcurrentModificationException for Metadata in StatusMetricsBolt <a href="https://github.com/DigitalPebble/storm-crawler/pull/909" target="_blank">#909</a></span></li></ul></div><p><span style="font-family: arial;"><br /></span></p><p><span style="font-family: arial;"><br /></span></p><p><br /></p>Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-44240017952369807592021-05-05T11:32:00.003+01:002021-05-05T11:32:35.793+01:00What's new in StormCrawler 1.18<p> <br /><span style="font-family: Arial;"><span style="font-size: 14.6667px; white-space: pre-wrap;"><a href="https://stormcrawler.net" target="_blank">StormCrawler</a> 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).
</span></span></p><p><span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;">This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for <a href="https://github.com/crawler-commons/url-frontier" target="_blank">URLFrontier</a> (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.</span></p><p><span style="font-family: Arial;"><span style="font-size: 14.6667px; white-space: pre-wrap;">1.18 is also likely to be the last release based an Apache Storm 1.x, our<a href="https://twitter.com/digitalpebble/status/1387071542455046150"> 2.x branch will become master </a>as soon as I have released 2.1.</span></span></p><p><span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;">Happy crawling and thanks to our <a href="https://github.com/sponsors/DigitalPebble" target="_blank">sponsors</a>, contributors and users!</span></p><p><span style="font-family: Arial; font-size: 11pt; white-space: pre-wrap;"><span></span></span></p><a name='more'></a><h2 style="text-align: left;"><span style="font-family: Arial; font-size: 11pt; white-space: pre-wrap;">Dependency upgrades</span></h2><p></p><span id="docs-internal-guid-1e1c3306-7fff-3612-1067-eeefbce57196"><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Tika 1.26 <a href="https://github.com/DigitalPebble/storm-crawler/issues/869">#869</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Icu4j.version 68.2 <a href="https://github.com/DigitalPebble/storm-crawler/issues/855">#855</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Httpclient 4.5.13 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/855" style="font-size: 11pt; white-space: pre-wrap;">#855</a></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Rometools 1.15.0 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/855" style="font-size: 11pt; white-space: pre-wrap;">#855</a></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">okhttp 4.9.1 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/855" style="font-size: 11pt; white-space: pre-wrap;">#855</a></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">SOLR 8.8.0 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/855" style="font-size: 11pt; white-space: pre-wrap;">#855</a></p></li></ul><br /><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Core</span></h2><br /><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">FileSpout doesn't replay failed tuples? <a href="https://github.com/DigitalPebble/storm-crawler/issues/816">#816</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Simplify indexer config when the metadata key is the same as the field <a href="https://github.com/DigitalPebble/storm-crawler/issues/819">#819 </a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">HttpHeaders#formatDate fails to parse date and returns always an empty string <a href="https://github.com/DigitalPebble/storm-crawler/issues/821">#821</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">HTTP date formatter to follow RFC 7231 <a href="https://github.com/DigitalPebble/storm-crawler/issues/820">#820</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">HTTP protocol implementation: allow to configure which protocol version(s) to use <a href="https://github.com/DigitalPebble/storm-crawler/issues/827" target="_blank">#827 </a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">FileSpout: ClassCastException if message ID of failed tuples is not of type byte[] <a href="https://github.com/DigitalPebble/storm-crawler/issues/826" target="_blank">#826</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Fetcher logQueuesContent won't be called if no new tuples are getting in <a href="https://github.com/DigitalPebble/storm-crawler/issues/838" target="_blank">#838</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Set user-agent as a one liner <a href="https://github.com/DigitalPebble/storm-crawler/issues/846" target="_blank">#846</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Add option to completely skip text extraction <a href="https://github.com/DigitalPebble/storm-crawler/issues/848" target="_blank">#848</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Provide option for a faster charset detection strategy <a href="https://github.com/DigitalPebble/storm-crawler/issues/849">#849</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="color: red;"><b>BREAKING CHANGE</b> </span> Scheduler implementations return an Optional<Date><a href="https://github.com/DigitalPebble/storm-crawler/issues/866">#866</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Jsoupfilters <a href="https://github.com/DigitalPebble/storm-crawler/issues/877" target="_blank">#877</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Add JSoup specific parse filters enhancement parser <a href="https://github.com/DigitalPebble/storm-crawler/issues/847">#847</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">need a more reliable detection of whether a document has been already parsed by Jsoup <a href="https://github.com/DigitalPebble/storm-crawler/issues/875" target="_blank">#875</a></span></p></li></ul><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Default setting for 'selenium.pageLoadTimeout' leads to 'InvalidArgumentException' when using Selenium </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/882" style="text-decoration-line: none;"><span style="color: #1155cc; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">#882</span></a></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Track time spent in DNS resolution by OKHTTP </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/878" style="text-decoration-line: none;"><span style="color: #1155cc; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">#878</span></a></p></li></ul><br /><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Archetypes</span></h2><div><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Archetypes use okttp protocol <a href="https://github.com/DigitalPebble/storm-crawler/issues/845" target="_blank">#845</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Archetypes generate topologies with Tika parsing <a href="https://github.com/DigitalPebble/storm-crawler/issues/858" target="_blank">#858</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Add MimeTypeNormalization parse filter to topologies generated from archetypes <a href="https://github.com/DigitalPebble/storm-crawler/issues/860">#860</a></span></p></li></ul><br /><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Elasticsearch</span></h2><div><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Can't skip text or url fields in indexing <a href="https://github.com/DigitalPebble/storm-crawler/issues/818">#818</a> </span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Elasticsearch IndexerBolt: tuples with canonical URL may not get acked <a href="https://github.com/DigitalPebble/storm-crawler/issues/832">#832</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Add JUnit tests for ES module <a href="https://github.com/DigitalPebble/storm-crawler/issues/834" target="_blank">#834</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">JUnit tests for ES + tuples with canonical URL may not get acked <a href="https://github.com/DigitalPebble/storm-crawler/issues/836" target="_blank">#836</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">StatusUpdaterBolt should use timeField() to index nextFetchDate? <a href="https://github.com/DigitalPebble/storm-crawler/issues/824" target="_blank">#824</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Add Deletion bolt to Flux version of the Elasticsearch topo from the archetype <a href="https://github.com/DigitalPebble/storm-crawler/issues/859" target="_blank">#859</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Do not generate a nextFetchDate at all if the scheduling is set to NEVER <a href="https://github.com/DigitalPebble/storm-crawler/issues/861" target="_blank">#861</a></span></p></li></ul><br /><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WARC</span></h2><div><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WARCSpout: add metadata field "fetch.statusCode" (HTTP status code) <a href="https://github.com/DigitalPebble/storm-crawler/issues/823" target="_blank">#823</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WARCSpout/FileSpout: ClassCastException if message ID of failed tuples is not of type byte[] <a href="https://github.com/DigitalPebble/storm-crawler/issues/826">#826</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WARCSpout to add _request.time_ to metadata <a href="https://github.com/DigitalPebble/storm-crawler/issues/831">#831</a> </span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WARCSpout doesn't handle http.content.limit -1 correctly <a href="https://github.com/DigitalPebble/storm-crawler/issues/850" target="_blank">#850</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">WARCSpout: IllegalArgumentException if http.content.limit == -1 <a href="https://github.com/DigitalPebble/storm-crawler/issues/833">#833</a></span></p></li></ul><br /><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Urlfrontier</span></h2><div><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div><ul style="margin-bottom: 0; margin-top: 0; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Add URLFrontier module external <a href="https://github.com/DigitalPebble/storm-crawler/issues/865" target="_blank">#865</a> <a href="https://github.com/DigitalPebble/storm-crawler/issues/868">#868</a></span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Spout to stream incoming results instead of using a blocking call <a href="https://github.com/DigitalPebble/storm-crawler/issues/879">#879</a></span></p></li></ul></span>Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-80501881371461522792020-07-20T16:30:00.000+01:002020-07-20T16:30:48.595+01:00Please welcome StormCrawler 2.0<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Nearly 6 years after its initial release and after another 32 releases, StormCrawler has just reached version 2.0! </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This is similar to what we did 4 years ago when 1.0 was released, in that the change of major version reflects the version of Apache Storm that StormCrawler is based on. This is not a major refactoring of StormCrawler in any way, although some minor changes can be found, mainly in the way the topologies are submitted. These changes are documented in the READMEs generated by our archetypes.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In terms of functionalities and behavior, StormCrawler 2.0 is similar to the version 1.17 released a few minutes ago.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I expect to keep both branches in parallel for a bit, at least until StormCrawler 2.0 has been sufficiently tested and is used by the majority of our users.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The change to Apache Storm 2 is not just a way of future-proofing StormCrawler, since version 2 is the current branch in Apache Storm. By adopting Storm 2, we are also getting a platform 100% Java making debugging and possible contributions to Apache Storm itself, and we also benefit from Storm's recent improvements such as improved performance and better backpressure model.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I am looking forward to getting feedback (and bugfixes) from the StormCrawler community. Please give StormCrawler 2.0 a try if you can.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Happy crawling! </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<br />
<br /></div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-49771273674415726382020-07-20T16:03:00.001+01:002020-07-20T16:03:36.383+01:00What's new in StormCrawler 1.17<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
I have just released <a href="https://github.com/DigitalPebble/storm-crawler/milestone/28?closed=1" target="_blank">StormCrawler 1.17</a>. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.<br />
<h3 style="text-align: left;">
Dependency upgrades</h3>
<div>
<ul style="text-align: left;">
<li>Various dependency upgrades <a href="https://github.com/DigitalPebble/storm-crawler/issues/808" target="_blank">#808</a></li>
<li>CrawlerCommons 1.1 dependency <a href="https://github.com/DigitalPebble/storm-crawler/issues/807" target="_blank">#807</a></li>
<li>Tika 1.24.1 <a href="https://github.com/DigitalPebble/storm-crawler/issues/797" target="_blank">#797</a></li>
<li>Jackson-databind #803 #793 #798</li>
</ul>
</div>
<h3 style="text-align: left;">
Core</h3>
<div>
<ul style="text-align: left;">
<li>Use regular expressions for custom number of threads per queue fetcher <a href="https://github.com/DigitalPebble/storm-crawler/issues/788" target="_blank">#788</a></li>
<li><span style="color: red;">/!breaking!/</span> Prefix protocol metadata <a href="https://github.com/DigitalPebble/storm-crawler/pull/789" target="_blank">#789</a></li>
<li>Basic authentication for OKHTTP <a href="https://github.com/DigitalPebble/storm-crawler/issues/792" target="_blank">#792</a></li>
<li>Utility to debug / test parsefilters <a href="https://github.com/DigitalPebble/storm-crawler/issues/794" target="_blank">#794</a></li>
<li><span style="color: red;">/!breaking!/</span> Remove deprecated methods and fields enhancement <a href="https://github.com/DigitalPebble/storm-crawler/issues/791" target="_blank">#791</a></li>
<li>AdaptiveScheduler to set last-modified time in metadata <a href="https://github.com/DigitalPebble/storm-crawler/issues/777" target="_blank">#777</a> <a href="https://github.com/DigitalPebble/storm-crawler/issues/812" target="_blank">#812</a></li>
<li><span style="color: lime;">/bugfix/ </span>_fetch.exception_ key should be removed from metadata if subsequent fetches are successful <a href="https://github.com/DigitalPebble/storm-crawler/issues/813" target="_blank">#813</a></li>
<li><span style="color: lime;">/bugfix/</span><span style="background-color: #fdfbf5; color: lime; font-family: arial, helvetica, sans-serif; font-size: 13.3333px; vertical-align: baseline; white-space: pre-wrap;"> </span>SimpleFetcherBolt maxThrottleSleepMSec not deactivated <a href="https://github.com/DigitalPebble/storm-crawler/issues/814" target="_blank">#814</a></li>
<li><span style="color: red;">/!breaking!/</span> Index pages with content="noindex,follow" meta tag <a href="https://github.com/DigitalPebble/storm-crawler/issues/750" target="_blank">#750</a></li>
<li>Enable extension parsing for SitemapParser enhancement parser <a href="https://github.com/DigitalPebble/storm-crawler/issues/749" target="_blank">#749</a> #<a href="https://github.com/DigitalPebble/storm-crawler/pull/815" target="_blank">815</a></li>
</ul>
<div>
<h3>
WARC</h3>
</div>
</div>
<br />
<ul style="text-align: left;">
<li>Implement WARC spout <a href="https://github.com/DigitalPebble/storm-crawler/issues/755" target="_blank">#755</a> <a href="https://github.com/DigitalPebble/storm-crawler/issues/799" target="_blank">#799</a></li>
</ul>
<br />
<h3 style="text-align: left;">
Elasticsearch</h3>
<br />
<ul style="text-align: left;">
<li><span style="background-color: #fdfbf5; color: lime; font-family: arial, helvetica, sans-serif; font-size: 13.3333px; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="background-color: #fdfbf5; color: lime; font-family: arial, helvetica, sans-serif; font-size: 13.3333px; vertical-align: baseline; white-space: pre-wrap;"> </span>AggregationSpout error due SimpleDateFormat not thread safe <a href="https://github.com/DigitalPebble/storm-crawler/issues/809" target="_blank">#809</a></li>
<li><span style="background-color: #fdfbf5; color: lime; font-family: arial, helvetica, sans-serif; font-size: 13.3333px; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="background-color: #fdfbf5; color: lime; font-family: arial, helvetica, sans-serif; font-size: 13.3333px; vertical-align: baseline; white-space: pre-wrap;"> </span>IndexerBolt issue causing ack failures <a href="https://github.com/DigitalPebble/storm-crawler/issues/801" target="_blank">#801</a></li>
<li>Allow ES to connect over a proxy <a href="https://github.com/DigitalPebble/storm-crawler/issues/787" target="_blank">#787</a></li>
</ul>
<div>
Of the breaking changes above, <a href="https://github.com/DigitalPebble/storm-crawler/pull/789" target="_blank">#789</a> is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add </div>
<div>
<br /></div>
<div>
<span style="background-color: white; color: #032f62; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, monospace; font-size: 12px; white-space: pre;">protocol.md.prefix: ""</span></div>
<div>
<br /></div>
<div>
to the configuration. Similarly, <i>http.skip.robots </i>has changed to <i>http.robots.file.skip</i></div>
<div>
<i><br /></i></div>
<div>
<i><br /></i></div>
<div>
Thanks to all contributors and users! Happy crawling! </div>
<div>
<br /></div>
<div>
PS: something equally exciting is coming next ;-)</div>
<div>
<i><br /></i></div>
<div>
<i><br /></i></div>
<div>
<i><br /></i></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-19616555408186246092020-01-16T12:17:00.004+00:002020-01-16T14:48:08.365+00:00What's new in StormCrawler 1.16?<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Happy new year!</span><br />
<br />
<span style="color: black; font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline; white-space: pre;"><a href="http://stormcrawler.net/" target="_blank">StormCrawler</a> 1.16 was released a couple of days ago. </span><span style="color: black; font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline; white-space: pre;">You can find the full list of changes on </span><a href="https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1" style="font-family: arial, helvetica, sans-serif;">https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1</a></div>
</div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
</div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre-wrap;">As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.</span><br />
<h2>
<span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre-wrap;">Dependency upgrades</span></h2>
</div>
</div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Tika 1.23 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/771" target="_blank">#771</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">ES 7.5.0 (<a href="https://github.com/DigitalPebble/storm-crawler/issueshttps://github.com/DigitalPebble/storm-crawler/issueshttps://github.com/DigitalPebble/storm-crawler/issueshttps://github.com/DigitalPebble/storm-crawler/issueshttps://github.com/DigitalPebble/storm-crawler/issueshttps://github.com/DigitalPebble/storm-crawler/issueshttps://github.com/DigitalPebble/storm-crawler/issues/770" target="_blank">#770</a>) </span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (<a href="https://github.com/DigitalPebble/storm-crawler/issues/767" target="_blank">#767</a>)</span></li>
</ul>
<h2 style="text-align: left;">
Core</h2>
<div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">OKHttp configure authentication for proxies (<a href="https://github.com/DigitalPebble/storm-crawler/issues/751" target="_blank">#751</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (<a href="https://github.com/DigitalPebble/storm-crawler/issues/754" target="_blank">#754</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: lime;"><span style="vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="vertical-align: baseline; white-space: pre-wrap;"> </span></span>okhttp protocol: reliably mark trimmed content because of content limit (<a href="https://github.com/DigitalPebble/storm-crawler/issues/757" target="_blank">#757</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: red;">/!breaking!/ </span>urlbuffer code in a separate package + 2 new implementations (<a href="https://github.com/DigitalPebble/storm-crawler/issues/764" target="_blank">#764</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(<a href="https://github.com/DigitalPebble/storm-crawler/issues/768" target="_blank">#768</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">okhttp protocol: HTTP request header lacks protocol name and version (<a href="https://github.com/DigitalPebble/storm-crawler/issues/775" target="_blank">#775</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Locking mechanism for Metadata objects (<a href="https://github.com/DigitalPebble/storm-crawler/issues/781" target="_blank">#781</a>)</span></li>
</ul>
</div>
<div>
<h2 style="text-align: left;">
LangID</h2>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: lime;"><span style="vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="vertical-align: baseline; white-space: pre-wrap;"> </span></span>langID parse filter gets stuck (<a href="https://github.com/DigitalPebble/storm-crawler/issues/758" target="_blank">#758</a>)</span></li>
</ul>
</div>
<div>
<h2 style="text-align: left;">
Elasticsearch</h2>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: lime; white-space: pre-wrap;">/bugfix/</span><span style="color: red; white-space: pre-wrap;"> </span>Fix NullPointerException in JSONResourceWrappers (<a href="https://github.com/DigitalPebble/storm-crawler/issues/760" target="_blank">#760</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">ES specify field used for grouping the URLs explicitly in mapping (<a href="https://github.com/DigitalPebble/storm-crawler/issues/761" target="_blank">#761</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Use search after for pagination in HybridSpout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/762" target="_blank">#762</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Filter queries in ES can be defined as lists (<a href="https://github.com/DigitalPebble/storm-crawler/issues/765" target="_blank">#765</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><i>es.status.bucket.sort.field </i>can take a list of values (<a href="https://github.com/DigitalPebble/storm-crawler/issues/766" target="_blank">#766</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Archetype for SC+Elasticsearch (<a href="https://github.com/DigitalPebble/storm-crawler/issues/773" target="_blank">#773</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">ES merge seed injection into crawl topology (<a href="https://github.com/DigitalPebble/storm-crawler/issues/778" target="_blank">#778</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Kibana - change format of templates to ndjson (<a href="https://github.com/DigitalPebble/storm-crawler/issues/780" target="_blank">#780</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: lime;"><span style="vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="vertical-align: baseline; white-space: pre-wrap;"> </span></span>HybridSpout get key for results when prefixed by "metadata." (<a href="https://github.com/DigitalPebble/storm-crawler/issues/782" target="_blank">#782</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">AggregationSpout to store <i>sortValues</i> for the last result of each bucket (<a href="https://github.com/DigitalPebble/storm-crawler/issues/783" target="_blank">#783</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Import Kibana dashboards using the API (<a href="https://github.com/DigitalPebble/storm-crawler/issues/785" target="_blank">#785</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Include Kibana script and resources in ES archetype (<a href="https://github.com/DigitalPebble/storm-crawler/issues/786" target="_blank">#786</a>)</span></li>
</ul>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
</div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend </span><span style="font-family: "arial" , "helvetica" , sans-serif;">(</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/773" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#773</a><span style="font-family: "arial" , "helvetica" , sans-serif;">)</span><span style="font-family: "arial" , "helvetica" , sans-serif;">. This is done by calling</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="background-color: rgba(27 , 31 , 35 , 0.05); color: #24292e; font-family: , "consolas" , "liberation mono" , "menlo" , monospace; font-size: 13.6px;">mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">The generated project also contains a script and resources to load templates into Kibana.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">The <a href="https://digitalpebble.blogspot.com/2019/09/stormcrawler-1.html" target="_blank">previous release</a> included URLBuffers, with just one simple implementation. Two new implementations have been added in <a href="https://github.com/DigitalPebble/storm-crawler/issues/764" target="_blank">#764</a>. The brand new <i>PriorityURLBuffer</i> sorts the buckets by the number of acks they got since the last sort whereas the <i>SchedulingURLBuffer</i> tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Finally, we added a soft locking mechanism to Metadata (<a href="https://github.com/DigitalPebble/storm-crawler/issues/781" target="_blank">#781</a>) to help trace the source of ConcurrentModificationExceptions. I</span><span style="font-family: "arial" , "helvetica" , sans-serif;">f you are experiencing such exceptions, calling <i>metadata.lock() </i>when emitting e.g.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<i><span style="font-family: "courier new" , "courier" , monospace;">collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))</span></i></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">will trigger an exception whenever the metadata object is modified somewhere else. You might need to call <i>unlock()</i> in the subsequent bolts.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">This does not change the way the Metadata works but is just there to help you debug.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Hopefully, we should be able to release 2.0 in the next few months. In the meantime, h</span><span style="font-family: "arial" , "helvetica" , sans-serif;">appy crawling and a massive thank you to all contributors!</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<br /></div>
</div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-50789448608152436882019-09-19T09:54:00.003+01:002019-09-19T09:55:29.825+01:00What's new in StormCrawler 1.15?<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<div style="text-align: justify;">
<span style="color: black; font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline; white-space: pre;"><a href="http://stormcrawler.net/" target="_blank">StormCrawler</a> 1.15 was released yesterday and as usual, contains loads of improvements and bugfixes.</span></div>
</div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<div style="text-align: justify;">
<span style="color: black; font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline; white-space: pre;"><br /></span></div>
</div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: black; vertical-align: baseline; white-space: pre;">You can find the full list of changes on </span><a href="https://github.com/DigitalPebble/storm-crawler/milestone/25?closed=1">https://github.com/DigitalPebble/storm-crawler/milestone/25?closed=1</a></span></div>
</div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
</div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre-wrap;">We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.</span><br />
<h2>
<span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre-wrap;">Dependency upgrades</span></h2>
</div>
</div>
<ul style="text-align: left;">
<li>Storm 1.2.3 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/743">#743</a>)</li>
<li>JSOUP 1.12.1 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/741">#741</a>)</li>
<li>ES 7.3.0 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/742">#742</a>)</li>
<li>Tika 1.22 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/726">#726</a>)</li>
</ul>
<h2 style="text-align: left;">
Core</h2>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;"><span style="color: red; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;"> </span>CharsetIdentification crashes on binary content (<a href="https://github.com/DigitalPebble/storm-crawler/issues/747">#747</a>)
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;">FetcherBolt skips tuples which have spent too much time in queues (<a href="https://github.com/DigitalPebble/storm-crawler/issues/746">#746</a>)
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;">Fetcher bolts generate metrics for HTTP status (<a href="https://github.com/DigitalPebble/storm-crawler/issues/745">#745</a>)
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;">improvements to URLFilterBolt (<a href="https://github.com/DigitalPebble/storm-crawler/issues/740">#740</a>)
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;"><span style="color: red; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;"> </span>FetcherBolt doesn't recover when entering maxNumberURLsInQueues (<a href="https://github.com/DigitalPebble/storm-crawler/issues/738">#738</a>)</span></li>
<li><span style="color: red; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;">RemoteDriverProtocol does not set user agent correctly (<a href="https://github.com/DigitalPebble/storm-crawler/issues/735">#735</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;">Force English Locale for SimpleDateFormat in cookie converter (<a href="http://0.0.2.220/">#732</a>)</span></li>
</ul>
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">LangID</span></h2>
</div>
<div>
<ul style="text-align: left;"><span style="vertical-align: baseline; white-space: pre;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline; white-space: pre;">LangId normalises and returns value found via extraction (<a href="https://github.com/DigitalPebble/storm-crawler/issues/733">#733</a>)</span></li>
</span></ul>
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif; white-space: pre;">Elasticsearch</span></h2>
</div>
<div style="text-align: left;">
<div>
<ul style="text-align: left;"><span style="font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline;">
<li>
Pluggable URLBuffer and Hybrid Elasticsearch spout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/752">#752</a>)
</li>
<li>
ES spouts control how long the search is allowed to take with timeout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/753">#753</a>)
</li>
<li>
Improve types used for numeric values for metrics mappings (<a href="https://github.com/DigitalPebble/storm-crawler/issues/744">#744</a>)
</li>
<li>
Use sniffer for ES connections (<a href="https://github.com/DigitalPebble/storm-crawler/issues/734">#734</a>)
</li>
<li>
ScrollSpout to quit logging when finished (<a href="https://github.com/DigitalPebble/storm-crawler/issues/725">#727</a>)
</li>
<li>
ES spouts use nextFetchDate RangeQuery as a filter (<a href="https://github.com/DigitalPebble/storm-crawler/issues/725">#725</a>)
</li>
<li>
MetricsConsumer takes an optional date format (<a href="https://github.com/DigitalPebble/storm-crawler/issues/724">#724</a>)
</li>
<li>
StatusMetricsBolt returns a max of 10K results per status (<a href="https://github.com/DigitalPebble/storm-crawler/issues/723">#723</a>)
</li>
</span></ul>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline;"><span style="white-space: pre;"><br /></span></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline;">Happy crawling and thanks to all contributors!</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline;"><span style="white-space: pre;"><br /></span></span></div>
</div>
<span style="font-family: "arial" , "helvetica" , sans-serif; vertical-align: baseline;">
</span></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-1620909440810038062019-05-24T15:54:00.000+01:002019-05-24T15:54:02.381+01:00Reindexing StormCrawler's URL Status Index<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
StormCrawler holds the frontier and the page fetch status in the "status" index. It's a sharded Elasticsearch index (in the most common setup) and every document contains the page URL, the fetch status, the time when the URL should be fetched, and further information as metadata. The index uses the SHA-256 of the URL as id and is usually sharded. How URLs are distributed among shards is directly related to crawler politeness. A crawler needs to limit the load to a particular server. If all URLs of the same host or domain are kept in the same shard, the spouts (we run one spout per shard) can already control that only a limited number of URLs per host or domain is emitted into the topology during a certain time interval. A controlled flow of URLs supports the fetcher bolt which ultimatively enforces politeness and guarantees a minimum delay between successive requests sent to the same server.</div>
<div style="text-align: justify;">
<br /></div>
<h2>
Motivation</h2>
The status index grows over time. Sooner or later, unless you decide to delete it and start the crawler from scratch, you hit one of the possible reasons why to reindex it:<br />
<ol>
<li>change the number of shards (to allow the index growing)</li>
<li>fix domain names acting as sharding key (see the discussion in <a href="https://github.com/DigitalPebble/storm-crawler/issues/684" target="_blank">#684</a>)</li>
<li>strip metadata to save storage space</li>
<li>apply current URL filters and normalization rules to all URLs in the status index</li>
<li>remove the document type – previously the documents in the "status" index were of type "status". In Elasticsearch 7.0 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.0/removal-of-types.html" target="_blank">document types are deprecated</a> and consequently SC 1.14 also removed the document types in all indexes (status, metrics, content)</li>
<li>merge/append indexes</li>
<li>pull or push to another ES cluster</li>
<li>... (there are some more good reasons, for sure)</li>
</ol>
<div style="text-align: justify;">
In my case it was a combination of points 1, 2, 3 and 5 applied to a status index holding 250 million URLs of <a href="https://commoncrawl.org/2016/10/news-dataset-available/" target="_blank">Common Crawl's news crawler</a>. At the same time, the server has been upgraded and I had to move the index to a new machine (point 7). Enough reasons to try the reindexing topology which has been introduced in StormCrawler 1.14 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/688" target="_blank">#688</a>). Of course, some of the points (but not all) could also be achieved by Elasticsearch standard tools.</div>
<h2>
Configure the Reindex Topology</h2>
<div style="text-align: justify;">
There are multiple scenarios possible: both Elasticsearch indexes are on the same machine or cluster (using <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html" target="_blank">index aliases</a>), or you pull or push from/to a remote index. My decision was to run the reindex topology on the machine hosting the new index and pull the data from the old index. To easily get around the Elasticsearch authentication, I've simply forwarded the remote ES port to the local machine running <span style="font-family: "courier new" , "courier" , monospace;">ssh -R 9201:localhost:9200 <ip_new> -fN &</span> on the old machine. The old index is now visible on port 9201 on the new machine.</div>
<br />
<div style="text-align: justify;">
In any case, you need to first upgrade Elasticsearch on the target and source machine/cluster to version 7.0 required by StromCrawler 1.14. Or in other words, both Elasticsearch indexes must be compatible with the Elasticsearch client used by StormCrawler running the reindexing topology.</div>
<br />
The topology is easily configured as <a href="https://storm.apache.org/releases/1.2.2/flux.html" target="_blank">Storm Flux</a>:<br />
<br />
<pre style="font-size: x-small;"> name: "reindexer"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: false # let the properties defined in the flux take precedence
- resource: false
file: "es-conf.yaml"
override: false
config:
# topology settings
topology.max.spout.pending: 600
topology.workers: 1
# old index (remote, forwarded to port 9201)
es.status.addresses: "http://localhost:9201"
es.status.index.name: "status"
es.status.routing.fieldname: "metadata.hostname"
es.status.concurrentRequests: 1
# new index (local)
es.status2.addresses: "localhost"
es.status2.index.name: "status"
es.status2.routing: true
es.status2.routing.fieldname: "metadata.hostname"
es.status2.bulkActions: 500
es.status2.flushInterval: "1s"
es.status2.concurrentRequests: 1
es.status2.settings:
cluster.name: "elasticsearch"
spouts:
# - id: "filter"
# className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
# parallelism: 10
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.ScrollSpout"
parallelism: 10 # must be equal to number of shards in the old index
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 4
constructorArgs:
- "status2"
streams:
- from: "spout"
# to: "filter"
# grouping:
# type: FIELDS
# args: ["url"]
# streamId: "status"
# - from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
</pre>
<br />
The parts to plug in an URL filter bolt are commented out. Of course, you should review all settings carefully so that they fit your situation. One important point is the number of spouts which must be equal to the number of shards in the old index.<br />
<br />
After some failed trials I've decided to choose defensive performance settings:
<br />
<ul>
<li>a not too high number of tuples pending in the topology: <code>topology.max.spout.pending: 600</code>. With 10 spouts there are max. 6000 pending tuples allowed.</li>
<li>four status bolts without concurrent requests (<code>es.status2.concurrentRequests: 1</code>).</li>
</ul>
<br />
To speed up the reindexing you might also change the refresh interval of the new Elasticsearch index by calling:
<br />
<pre> curl -H Content-Type:application/json -XPUT 'http://localhost:9200/status/_settings' \
--data '{"index" : {"refresh_interval" : -1 }}'
</pre>
<br />
Don't forget to set it back to the default if the reindexing is done:
<br />
<pre> curl ... --data '{"index" : {"refresh_interval" : null }}'
</pre>
<br />
<h2>
Running the Topology</h2>
The reindex topology is started as Flux via:
<br />
<pre> java -cp crawler-1.14.jar org.apache.storm.flux.Flux --remote reindex-flux.yaml
</pre>
<br />
Depending on the size of your index it might run longer. I've achieved 6,000 documents reindexed per second which means that the entire 250 million docs are reindexed after 12 hours.<br />
<br />
<h2>
Verifying the New Status Index</h2>
The topology has finished now let's check the new index and whether everything has been reindexed. Let's first get the metrics of the old status index (on port 9201):<br />
<pre>%> curl -H Content-Type:application/json -XGET 'http://localhost:9201/_cat/indices/status?v'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open status xFN4EGHYRaGw8hi4qxMWrg 10 1 250305363 6038664 101gb 101gb
</pre>
and compare it with those of the new status index:
<br />
<pre>%> curl -H Content-Type:application/json -XGET 'http://localhost:9200/_cat/indices/status?v'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open status ZFNT8FovT4eTl3140iOUvw 16 1 250278257 5938 87.2gb 87.2gb
</pre>
Great! The new index now has 16 shards and 10 GB storage have been saved by stripping unneeded metadata. All 250 million documents have been reindexed. But stop: it's not all documents – a few thousand are missing! Panic! What's going on? Checked the logs for errors – nothing. Also: why there are deleted documents in the new index?<br />
<br />
Ok, let's first calculate the loss: 250305363 – 250278257 = 27106 (0.01%). Well, probably not worth to redo the procedure, either the links are outdated or the crawler will find them again. Anyway, I was interested to figure out the reason. But how to find the missing URLs? – It's not trivial to compare two lists of 250 million items.<br />
<br />
The solution is to get first the counts of items per domain aggregating counts on the field "metadata.hostname" which is used for routing documents to shards. The idea is to find the domains where the counts differ and then compare only the per-domain lists. Let's do it:
<br />
<pre>curl -s -H Content-Type:application/json -XGET http://localhost:9200/status/_search --data '{
"aggs" : {
"agg" : {
"terms" : {
"field" : "metadata.hostname",
"size" : 100000
}
}
}
}' | jq --raw-output '.aggregations.agg.buckets[] | [(.doc_count|tostring),.key] | join("\t")' \
| rev | sort | rev >domain_counts_new_index.txt
</pre>
A short explanation what this command does:
<br />
<ol>
<li>we send an <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html">aggregation query</a> to the Elasticsearch index which counts documents per domain name (kept in "metadata.hostname")</li>
<li>the JSON output is processed by <a href="https://stedolan.github.io/jq/" target="_blank">jq</a> and the output is written into a text file with two columns – count and domain name</li>
<li>the list is sorted using reversed strings – this keeps the counts together per top-level domain, domain, subdomain which makes the comparison easier</li>
</ol>
The procedure to get the per-domain counts from the old index is the same. After we have both lists we can compare them side by side:
<br />
<pre>%> diff -y domain_counts_old_index.txt domain_counts_new_index.txt
...
4 athletics.africa | 80 athletics.africa
76 www.athletics.africa <
108 chri.ca 108 chri.ca
2 www.alta-frequenza.corsica | 2 alta-frequenza.corsica
2 corsenetinfos.corsica | 3 corsenetinfos.corsica
1 www.corsenetinfos.corsica <
...
</pre>
Already the first difference brought up the reason for the differences between the old and new index! You remember, there was this issue with domain names changing over time with different versions of the public suffix list? It was one of the reasons why the reindexing topology had been introduced, see <a href="https://github.com/DigitalPebble/storm-crawler/issues/684" target="_blank">#684</a> and <a href="https://github.com/commoncrawl/news-crawl/issues/28" target="_blank">news-crawler#28</a>). If the routing key is not stable it might happen that the same URL is indexed twice with different routing keys in two shards. Exactly this happened for some of the domains affected by this issue. Here is one example:
<br />
<pre>6 hr.de | 19 hr.de
2 reportage.hr.de <
1 daserste.hr.de <
11 www.hr.de <
</pre>
In the new index there are only 19 items although the sum of 6 + 2 + 1 + 11 is 20. I've checked the remaining differences and the assumption has proven true: affected are only domain names either with recently introduced TLDs (<a href="https://en.wikipedia.org/wiki/.africa" target="_blank">.africa</a> has been registered by ICANN in 2017) or misclassified suffixes (<code>co.uk</code> is a valid suffix but <code>hr.de</code> is not).<br />
<br />
Everything is fine now and the only open point is to bring the new index into production and restart the crawl topology on the new machine!<br />
<br /></div>
Sebastian Nagelhttp://www.blogger.com/profile/10154761575638007188noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-75753864818402684572019-05-13T17:07:00.001+01:002019-07-18T11:40:43.182+01:00What's new in StormCrawler 1.14<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><a href="http://stormcrawler.net/" target="_blank">StormCrawler</a> 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">You can find the full list of changes on </span><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;"><a href="https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1" style="text-decoration: none;">https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1</a></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="font-family: "arial"; font-size: 14.6667px; white-space: pre-wrap;">This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.</span></div>
<h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Dependency upgrades</span></h3>
<ul style="text-align: left;">
<li><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">crawler-commons 1.0</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/693" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#693</span></a></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">okhttp 3.14.0 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/692" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#692</span></a></li>
<li><span style="background-color: white; color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">guava 27.1</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/702" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#702</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="background-color: white; color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">icu4j 64.1 </span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/702" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#702</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="background-color: white; color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">httpclient 4.5.8 </span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/702" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#702</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="background-color: white; color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Snakeyaml 1.24 </span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/702" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#702</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="background-color: white; color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">wiremock 2.22.0 </span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/702" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#702</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="background-color: white; color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">rometools 1.12.0 </span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/702" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#702</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Elasticsearch 7.0.0 (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/708" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#708</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
</ul>
<h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Core</span></h3>
<b style="font-weight: normal;"><br /></b>
<br />
<ul style="text-align: left;">
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Track how long a spout has been without any URLs in its buffer (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/685" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#685</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Change ack mechanism for StatusUpdaterBolts</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/689" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#689</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Robots URL filter to get instructions from cache only (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/700" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#700</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Allow indexing under canonical URL if in the same domain, not just host (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/703" style="font-family: Arial; font-size: 10.5pt; font-weight: 700; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">#703</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: red; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> URLs ending with a space are fetched over and over again (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/704" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#704</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">ParseFilter to normalise the mime-type of documents into simple values (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/707" style="font-family: Arial; font-size: 10.5pt; font-weight: 700; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">#707</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Robot rules should check the cache in case of a redirection (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/709" style="font-family: Arial; font-size: 10.5pt; font-weight: 700; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">#709</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: red; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> Fix the logic around sitemap = false </span><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">(</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/710" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#710</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Reduce logging of exceptions in FetcherBolt (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/719" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#719</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
</ul>
<h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Elasticsearch</span></h3>
<b style="font-weight: normal;"><br /></b>
<br />
<ul style="text-align: left;">
<li><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/683" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#683</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">StatusUpdaterBolt to load config from non-default param names</span><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/687" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#687</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Add a ScrollSpout to read all the documents from a shard (</span><a href="https://github.com/DigitalPebble/storm-crawler/pull/688" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#688</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> and </span><a href="https://github.com/DigitalPebble/storm-crawler/pull/690" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#690</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">) - see <a href="http://digitalpebble.blogspot.com/2019/05/reindexing-stormcrawlers-url-status.html" target="_blank">in our guest post how this can be used to reindex a status index</a>.</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">ES IndexerBolt : check success of batches before acking tuples (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/647" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#647</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: red; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">URLs with content that breaks ES get refetched over and over again</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/705" style="font-family: Arial; font-size: 10.5pt; font-weight: 700; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">#705</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: red; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">URLs without valid host name (and routing) stay DISCOVERED forever</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/706" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#706</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: red; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">/bugfix/</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/715" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#715</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">MetricsConsumer to include topology ID in metrics</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">(</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/714" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#714</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
</ul>
<h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">WARC</span></h3>
<ul style="text-align: left;">
<li><span style="font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Generate WARC request records</span><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/509" style="font-family: Arial; font-size: 11pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#509</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">WARC format improvements (</span><a href="https://github.com/DigitalPebble/storm-crawler/pull/691" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#691</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">)</span></li>
</ul>
<h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Tika</span></h3>
<br />
<ul style="text-align: left;">
<li><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">Set mimetype whitelist for Tika Parser (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/712" style="font-family: Arial; font-size: 10.5pt; text-decoration-line: none; white-space: pre;"><span style="color: #1155cc; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;">#712)</span></a><span style="color: #24292e; font-family: "arial"; font-size: 10.5pt; vertical-align: baseline; white-space: pre-wrap;"> </span></li>
</ul>
<div>
<br /></div>
<div style="text-align: center;">
*********</div>
<div style="text-align: center;">
<br /></div>
<div>
<span style="color: #24292e; font-family: "arial";"><span style="font-size: 14px; white-space: pre-wrap;">I will be running a workshop on StormCrawler next month at the <a href="http://netpreserve.org/ga2019/programme/abstracts/#workshop-stormcrawler" target="_blank">Web Archiving Conference</a> in Zagreb and give a <a href="http://netpreserve.org/ga2019/programme/abstracts/#26" target="_blank">presentation</a> jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at <a href="http://elastic.com/" target="_blank">Elastic</a>.</span></span></div>
<div>
<br /></div>
<div>
<span style="color: #24292e; font-family: "arial";"></span><br />
<div>
<span style="color: #24292e; font-family: "arial";"><span style="font-size: 14px; white-space: pre-wrap;">As usual, thanks to all contributors and users.</span></span></div>
<span style="color: #24292e; font-family: "arial";">
<div>
<span style="font-size: 14px; white-space: pre-wrap;"><br /></span></div>
<div>
<span style="font-size: 14px; white-space: pre-wrap;">Happy crawling!</span></div>
<div style="font-size: 14px; white-space: pre-wrap;">
<br /></div>
</span></div>
<div>
<span style="color: #24292e; font-family: "arial";"><span style="font-size: 14px; white-space: pre-wrap;"><br /></span></span></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-30773752811226056942019-02-11T11:14:00.003+00:002019-02-16T21:15:15.547+00:00Meet StormCrawler users: Q&A with Pixray (Germany)<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: "arial" , "helvetica" , sans-serif; text-align: justify;"><span style="font-size: large;">We are opening a series of Q&A blogs with Maik Piel telling us about the use of <a href="http://stormcrawler.net/" target="_blank">StormCrawler</a> at <a href="https://pixray.com/" target="_blank">Pixray</a>. </span></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b><i><br /></i></b></span> <span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b><i>Q: What do you guys do at Pixray? Why do you need web crawling?</i></b></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">We are experts in image tracking on the web. We work for image rights holders to protect their pictures on the web as well as brands and manufacturers to monitor sales channels. Our customers range from news agencies and picture agencies, individual photographers, e-commerce companies to luxury brands. Web crawling is one of the core buildings blocks of our platform - next to a massive picture matching platform, various APIs and our customer portals.</span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b><i>Q: What sort of crawls do you do? How big are they?</i></b></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">We do three kinds of scans: broad scans across complete regions of the web (like the EU or North America), deep scans on single domains and also near-realtime discovery scans on thousands of selected domains. For all of these different scans, we employ customized versions of StormCrawler to match the very distinct requirements in crawling patterns. Obviously, the biggest crawls are the broad regional scans, including more than 10 billion URLs and tens of millions of different domains.</span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b><i>Q: What software stack do you use? e.g. SC + ES + Grafana? Hardware used?</i></b></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">Adapted and extended versions of StormCrawler as well as Elasticsearch and Kibana. We couple our crawling infrastructure with the rest of our platform through RabbitMQ. Our crawler is built on Ubuntu servers, with 32 GB of RAM and Intel Core I7 and 4 TB of disk space. Each runs Apache Storm and Elasticsearch. In the future, we will split the storage (Elasticsearch) and the computation (Storm) layers to separate hardware. We are also looking at options to employ container and service orchestration frameworks to scale our crawler infrastructure dynamically. </span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><i><b>Q: Why did you choose StormCrawler?</b></i></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">We initially built our crawler on <a href="http://nutch.apache.org/" target="_blank">Apache Nutch</a>. Needless to say that Nutch is a great and robust platform. But once you grow beyond a certain point you start to see limitations. The biggest limitation is the low responsiveness to changes and the uneven system utilization due to the long generate/crawl/update cycles. It sometimes took us 24 hours or more till we could see the effects of a change we made to the software. Furthermore, we found that it is a bit troublesome to get valid statistics data from Nutch in real time. StormCrawler solves all that for us. Every config or code change that we commit shows its effect immediately and you get statistics very, very easily. There is no long-cycle batching anymore in StormCrawler which gives us a very even and continuous crawling, reducing our need for massive queuing of results to ensure an even utilization of down-stream infrastructure. Kibana gives us great real-time insights into the crawl database. With Nutch, we had to run analysis jobs of around 4 hours, even if we just needed the status of a single url. </span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b><i>Q: What do you like the most / least in StormCrawler?</i></b></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">Besides the points mentioned above, we have to praise StormCrawlers extensibility. In our different setups we have both made changes to existing code in the StormCrawler project but also wrote large amounts of own code. The structure Apache Storm imposes is great. Components are very cleanly decoupled and it is easy to introduce custom functionality by just writing new Spouts and Bolts and linking them into the topology. For our use case we, of course, had to deal with pictures - which StormCrawler itself does not do. We just created our own Bolts for that. For our near-realtime discovery crawler, we needed an engine that calculates the revisit date for a URL based on various factors instead of a static value, again we could just create a specific spout for that. </span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b>Q: Anything in particular you'd like to have in a future release?</b></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <br />
<div style="text-align: left;">
<span style="font-size: large;"><span style="font-family: "arial" , "helvetica" , sans-serif;">It would be great to have a built-in way to prioritize different TLDs within the StormCrawler spouts. </span><span style="background-color: white; font-family: arial, helvetica, sans-serif;">We have built a custom solution for that which we might contribute back to StormCrawler at some point.</span></span></div>
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><br /></span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-46239867494845730832019-01-06T08:20:00.003+00:002019-01-06T08:20:45.451+00:00What's new in StormCrawler 1.13<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<div style="text-align: left;">
<span style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Happy new year!</span></span></div>
<div style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
<div style="text-align: left;">
<span style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">I have just released StormCrawler 1.13, which contains important bug fixes and some nice improvements.</span></span></div>
<div style="text-align: left;">
<span style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: justify; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
<div style="margin: 0px;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">As usual, we advise users to upgrade to this version.</span></div>
<div style="margin: 0px;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
<div style="margin: 0px;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
</div>
<div style="text-align: justify;">
</div>
<h3>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Dependency upgrades</span></h3>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Tika 1.20 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/676" target="_blank">#676</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Xerces 2.12.0 (<a href="https://github.com/DigitalPebble/storm-crawler/pull/672" target="_blank">#672</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Guava 27.0.1 (<a href="https://github.com/DigitalPebble/storm-crawler/pull/672" target="_blank">#672</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Elasticsearch 6.5.3 (<a href="https://github.com/DigitalPebble/storm-crawler/pull/672" target="_blank">#672</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Jackson 2.8.11.3 (<a href="https://github.com/DigitalPebble/storm-crawler/commit/14e44195b0a8771fc23931ca421748f7c69e7932" target="_blank">14e44</a>)</span></li>
</ul>
<div style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
<h3>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Core</span></h3>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">FileSpout uses StringTabScheme by default (<a href="https://github.com/DigitalPebble/storm-crawler/issues/664" target="_blank">#664</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">JSoupParserBolt outlink limit per page (<a href="https://github.com/DigitalPebble/storm-crawler/pull/670" target="_blank">#670</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><span style="color: lime;">/BUGFIX/ </span>Date format used for HTTP if-modified-since requests must follow RFC7231 (<a href="https://github.com/DigitalPebble/storm-crawler/pull/674" target="_blank">#674</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><span style="color: lime;">/BUGFIX/ </span>DeletionBolt expects Metadata from tuples (<a href="https://github.com/DigitalPebble/storm-crawler/issues/675" target="_blank">#675</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Added configurable TextExtractor to JSoupParserBolt (<a href="https://github.com/DigitalPebble/storm-crawler/pull/678" target="_blank">#678</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><span style="color: red;">!BREAKING!</span> Core Spouts should use status stream if withDiscoveredStatus is set to true (<a href="https://github.com/DigitalPebble/storm-crawler/issues/677" target="_blank">#677</a>)</span></li>
</ul>
<div style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
<h3>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">SQL</span></h3>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">SQL IndexerBolt (<a href="https://github.com/DigitalPebble/storm-crawler/issues/608" target="_blank">#608</a>)</span></li>
</ul>
<div style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
<h3>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Archetype</span></h3>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Archetype sets StormCrawler version in a property (<a href="https://github.com/DigitalPebble/storm-crawler/issues/668" target="_blank">#668</a>)</span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Replace ContentFilter with TextExtractor (<a href="https://github.com/DigitalPebble/storm-crawler/pull/678" target="_blank">#678</a>)</span></li>
</ul>
<span style="font-size: large;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">Apart from the changes to the core spouts (<a href="https://github.com/DigitalPebble/storm-crawler/issues/664" target="_blank">#664</a> and <a href="https://github.com/DigitalPebble/storm-crawler/issues/677" target="_blank">#677</a>), the main new feature is the addition of the TextExtractor</span><span style="font-family: Arial, Helvetica, sans-serif;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/pull/678" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#678</a><span style="font-family: Arial, Helvetica, sans-serif;">) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the </span><span style="font-family: Arial, Helvetica, sans-serif;">ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.</span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></span><br />
<div>
<div style="text-align: justify;">
<div>
<div style="font-family: arial, helvetica, sans-serif; text-align: left;">
<span style="font-size: large;">As usual, thanks to all contributors and users, <span style="font-family: "arial" , "helvetica" , sans-serif; text-align: justify;"> and particularly the <a href="http://www.gov.nt.ca/" target="_blank">Government</a></span><span style="font-family: "arial" , "helvetica" , sans-serif;"><a href="http://www.gov.nt.ca/" target="_blank"> of Northwest Territories</a></span><span style="font-family: "arial" , "helvetica" , sans-serif; text-align: justify;"> in Canada who kindly donated some of the code of the <span style="font-family: Arial, Helvetica, sans-serif; text-align: left;">TextExtractor</span>.</span></span></div>
<div style="font-family: arial, helvetica, sans-serif; text-align: left;">
<span style="font-size: large;"><br /></span></div>
<div style="font-family: arial, helvetica, sans-serif; text-align: left;">
<span style="font-size: large;">Happy crawling!</span></div>
</div>
</div>
<div style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<ul></ul>
</div>
</div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-91460709168040186522018-11-22T11:11:00.003+00:002018-11-30T10:48:10.031+00:00What's new in StormCrawler 1.12<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: "arial" , "helvetica" , sans-serif;">The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP</span><span style="font-family: "arial" , "helvetica" , sans-serif;"> (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/653" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#653</a><span style="font-family: "arial" , "helvetica" , sans-serif;">) and various other improvements.</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">As usual, we advise users to upgrade to this version.</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Dependency upgrades</span></h3>
<br />
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">JSOUP 1.11.3 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/663" target="_blank">#663</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Elasticsearch 6.5.0 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/661" target="_blank">#661</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Jackson and Wiremock dependencies (<a href="https://github.com/DigitalPebble/storm-crawler/issues/640" target="_blank">#640</a>)</span></li>
</ul>
<div>
<h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Core</span></h3>
</div>
<div>
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Post JSON data with OKHTTP protocol via metadata (<a href="https://github.com/DigitalPebble/storm-crawler/issues/641" target="_blank">#641</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Selenium RemoteDriverProtocol triggered by K/V in metadata (<a href="https://github.com/DigitalPebble/storm-crawler/issues/642" target="_blank">#642</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">SeleniumProtocol NavigationFilters not reached in case of a redirection (<a href="https://github.com/DigitalPebble/storm-crawler/issues/643" target="_blank">#643</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Limit crawl to URLs found in sitemaps (<a href="https://github.com/DigitalPebble/storm-crawler/issues/645" target="_blank">#645</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><i>spout.reset.fetchdate.after</i> based on time when query was set to NOW (<a href="https://github.com/DigitalPebble/storm-crawler/issues/648" target="_blank">#648</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Avoid StackOverflowError when generating DocumentFragment from JSOUP (<a href="https://github.com/DigitalPebble/storm-crawler/issues/653" target="_blank">#653</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">redirected sitemaps don't have <i>isSitemap=true</i> (<a href="https://github.com/DigitalPebble/storm-crawler/issues/660" target="_blank">#660</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Staggered scheduling of sitemap URLs (<a href="https://github.com/DigitalPebble/storm-crawler/issues/657" target="_blank">#657</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Scheduling -> round to the closest second, minute or hour (<a href="https://github.com/DigitalPebble/storm-crawler/issues/654" target="_blank">#654</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (<a href="https://github.com/DigitalPebble/storm-crawler/issues/662" target="_blank">#662</a>)</span></li>
</ul>
</div>
<div>
<h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">WARC</span></h3>
</div>
<div>
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">WARC record format: trailing zero byte causes WARC parser to fail (<a href="https://github.com/DigitalPebble/storm-crawler/issues/652" target="_blank">#652</a>)</span></li>
</ul>
<div>
<h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Elasticsearch</span></h3>
</div>
</div>
<div>
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">ES IndexerBolt track number of batch sent (<a href="https://github.com/DigitalPebble/storm-crawler/issues/540" target="_blank">#540</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Rename index <i>index</i> into docs (<a href="https://github.com/DigitalPebble/storm-crawler/issues/649" target="_blank">#649</a>)</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">ES StatusMetricsBolt generate metrics for total number of docs (<a href="https://github.com/DigitalPebble/storm-crawler/issues/651" target="_blank">#651</a>)</span></li>
</ul>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"></span><br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"> Coming next...</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"> </span><br />
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"> <div style="text-align: justify;">
The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!</div>
<div>
<br /></div>
<div>
Finally, there will be a StormCrawler workshop in <a href="https://www.bigdataconference.lt/workshops/" target="_blank">Vilnius</a> next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.</div>
<div>
<br /></div>
<div>
As usual, thanks to all contributors and users. Happy crawling!<br />
<br />
<h2 style="text-align: left;">
UPDATE</h2>
<div>
There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on </div>
<div>
<br /></div>
<div>
<a href="https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1">https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1</a></div>
<br />
<br />
<br />
<br />
<br /></div>
<div>
<br /></div>
</span></div>
</div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-43952144365854237632018-10-18T15:04:00.001+01:002018-10-18T15:04:06.864+01:00What's new in StormCrawler 1.11<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Arial, Helvetica, sans-serif;">I've just released StormCrawler 1.11, here are the main changes, some of which require modifications of your configuration.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Users should upgrade to this version as it fixes several bugs and adds loads of functionalities.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<b><span style="font-family: Arial, Helvetica, sans-serif;">Dependency upgrades</span></b><br />
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Tika 1.19.1 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/606" target="_blank">#606</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Elasticsearch 6.4.1 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/607" target="_blank">#607</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">SOLR 7.5 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/624" target="_blank">#624</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">OKHttp 3.11.0</span></li>
</ul>
</div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">Core</span></b><br />
<br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;"><b><span style="font-weight: 400;"><i>/<u>bugfix</u>/</i> </span></b>FetcherBolts original metadata overwrites metadata returned by protocol <b>(<a href="https://github.com/DigitalPebble/storm-crawler/issues/636" style="font-weight: 400;" target="_blank">#636</a><span style="font-weight: 400;">)</span></b></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Override Globally Configured Accepts and Accepts-Language Headers Per-URL <b>(<a href="https://github.com/DigitalPebble/storm-crawler/issues/634" style="font-weight: 400;" target="_blank">#634</a><span style="font-weight: 400;">)</span></b></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Support for cookies in okhttp implementation <b>(<a href="https://github.com/DigitalPebble/storm-crawler/issues/632" style="font-weight: 400;" target="_blank">#632</a><span style="font-weight: 400;">)</span></b></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">AbstractHttpProtocol uses StringTabScheme to parse input into URL and Metadata <b>(<a href="https://github.com/DigitalPebble/storm-crawler/issues/631" style="font-weight: 400;" target="_blank">#631</a><span style="font-weight: 400;">)</span></b></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Improve MimeType detection for interpreted server-side languages <b>(<a href="https://github.com/DigitalPebble/storm-crawler/issues/630" style="font-weight: 400;" target="_blank">#630</a><span style="font-weight: 400;">)</span></b></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><i>/<b style="font-style: normal;"><span style="font-weight: 400;"><i><u>bugfix</u></i></span></b>/ </i>Custom intervals in Scheduler can't contain dots <b>(<a href="https://github.com/DigitalPebble/storm-crawler/issues/616" style="font-weight: 400;" target="_blank">#616</a><span style="font-weight: 400;">)</span></b></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">OKHTTP protocol trust all SSL certificates (<a href="https://github.com/DigitalPebble/storm-crawler/issues/615" target="_blank">#615</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">HTTPClient protocol setDefaultMaxPerRoute based on max threads per queue (<a href="https://github.com/DigitalPebble/storm-crawler/issues/594" target="_blank">#594</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Fetcher Added byteLength to Metadata (<a href="https://github.com/DigitalPebble/storm-crawler/pull/599" target="_blank">#599</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">URLFilters + ParseFilters refactoring (<a href="https://github.com/DigitalPebble/storm-crawler/pull/593" target="_blank">#593</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">HTTPClient Add simple basic auth system (<a href="https://github.com/DigitalPebble/storm-crawler/pull/589" target="_blank">#589</a>)</span></li>
</ul>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">WARC</span></b></div>
<br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;"><b><span style="font-weight: 400;"><i>/bugfix/</i> </span></b>WARCHdfsBolt writes zero byte files (<a href="https://github.com/DigitalPebble/storm-crawler/issues/596" target="_blank">#596</a>)</span></li>
</ul>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">SOLR</span></b></div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">SOLR StatusUpdater use short status name (<a href="https://github.com/DigitalPebble/storm-crawler/issues/627" target="_blank">#627</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">SOLRSpout log queries, time and number of results (<a href="https://github.com/DigitalPebble/storm-crawler/issues/623" target="_blank">#623</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">SOLR spout - reuse nextFetchDate (<a href="https://github.com/DigitalPebble/storm-crawler/issues/622" target="_blank">#622</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Move reset.fetchdate.after to AbstractQueryingSpout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/628" target="_blank">#628</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Abstract functionalities of spout implementations (<a href="https://github.com/DigitalPebble/storm-crawler/issues/617" target="_blank">#617</a>) - see below</span></li>
</ul>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">SQL</span></b></div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">MetricsConsumer (<a href="https://github.com/DigitalPebble/storm-crawler/issues/612" target="_blank">#612</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Batch PreparedStatements in SQL status updater bolt, fixes (<a href="https://github.com/DigitalPebble/storm-crawler/issues/610" target="_blank">#610</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">QLSpout group by hostname and get top N results (<a href="https://github.com/DigitalPebble/storm-crawler/issues/609" target="_blank">#609</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Harmonise param names for SQL (<a href="https://github.com/DigitalPebble/storm-crawler/issues/619" target="_blank">#619</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Move reset.fetchdate.after to AbstractQueryingSpout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/628" target="_blank">#628</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Abstract functionalities of spout implementations (<a href="https://github.com/DigitalPebble/storm-crawler/issues/617" target="_blank">#617</a>) - see below</span></li>
</ul>
</div>
</div>
<br />
<div>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">Elasticsearch</span></b></div>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;"><i>/<b style="font-style: normal;"><span style="font-weight: 400;"><i><u>bugfix</u></i></span></b>/</i> NPE in AggregationSpout when there is not any status index created (<a href="https://github.com/DigitalPebble/storm-crawler/issues/597" target="_blank">#597</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><i>/<b style="font-style: normal;"><span style="font-weight: 400;"><i><u>bugfix</u></i></span></b>/ </i>NPE in CollapsingSpout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/595" target="_blank">#595</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Added ability to implement custom indexes names based on metadata information (<a href="https://github.com/DigitalPebble/storm-crawler/issues/591" target="_blank">#591</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">StatusMetricsBolt - Added check for avoid NPE when interacting with multi search response (<a href="https://github.com/DigitalPebble/storm-crawler/issues/598" target="_blank">#598</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Change default value of <i>es.status.reset.fetchdate.after</i> (<a href="https://github.com/DigitalPebble/storm-crawler/issues/590" target="_blank">#590</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Log error if elastic search reports an unexpected problem (<a href="https://github.com/DigitalPebble/storm-crawler/issues/575" target="_blank">#575</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">ES Wrapper for URLFilters implementing JSONResource (<a href="https://github.com/DigitalPebble/storm-crawler/issues/588" target="_blank">#588</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Move reset.fetchdate.after to AbstractQueryingSpout (<a href="https://github.com/DigitalPebble/storm-crawler/issues/628" target="_blank">#628</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Abstract functionalities of spout implementations (<a href="https://github.com/DigitalPebble/storm-crawler/issues/617" target="_blank">#617</a>) - see below</span></li>
</ul>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">As you've probably noticed, <a href="https://github.com/DigitalPebble/storm-crawler/issues/617" target="_blank">#617</a> affects ES, SOLR as well as SQL. The idea behind it is that the spout in these modules have a lot in common as they all query a backend for URLs to fetch. We moved some of the functionalities to a brand new class <a href="https://github.com/DigitalPebble/storm-crawler/blob/b858c013ece6c7aca94a98f5e0658ba3c8b9501a/core/src/main/java/com/digitalpebble/stormcrawler/persistence/AbstractQueryingSpout.java" target="_blank">AbstractQueryingSpout</a>, which greatly reduces the amount of code. The handling of the URL caching, TTL for the purgatory and min delay between queries is now done in that class. As a result, the spouts implementations have less to do and can focus on the specifics of getting the data from their respective backends. A nice side effect is that the SQL and SOLR spouts now benefit from some of the functionalities which were up to now only available in ES.</span></div>
</div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">You will need to update your configuration to replace the elements which were specific to ES by the generic ones i.e. <span style="background-color: white; color: #22863a; font-size: 12px; white-space: pre;">spout.reset.fetchdate.after</span>, <span style="background-color: white; color: #22863a; font-size: 12px; white-space: pre;">spout.ttl.purgatory </span>and <span style="background-color: white; color: #22863a; font-size: 12px; white-space: pre;">spout.min.delay.queries. </span>These are also used by SOLR and SQL. </span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">Please note that these changes also impact some of the metrics names.</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">Coming next...</span></b></div>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></b></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">Storm 2.0.0 should be released soon, which is very exciting! There's a <a href="https://github.com/DigitalPebble/storm-crawler/tree/2.x" target="_blank">branch</a> of StormCrawler which anticipates some of the changes, even though it hasn't been tested much yet. Give it a try if you want to be on the cutting edge!</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">I expect the SOLR and SQL backends to get further improvements and progressively catch up with our Elasticsearch resources.</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">Finally, our <a href="https://www.eventbrite.com/e/introduction-to-web-crawling-with-stormcrawler-and-elasticsearch-tickets-50949891497" target="_blank">Bristol workshop</a> next month is now full but there is <a href="https://www.bigdataconference.lt/introduction-to-web-crawling/" target="_blank">one in Vilnius on 27/11</a>. I'll also give <a href="https://www.bigdataconference.lt/Julien-Nioche/" target="_blank">a talk there the following day</a>. If you are around, come and say hi and get yourself a <a href="https://twitter.com/digitalpebble/status/1047781400441761792" target="_blank">StormCrawler sticker</a>.</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">As usual, thanks to all contributors and users. Happy crawling!</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="background-color: white; color: #22863a; font-size: 12px; white-space: pre;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></span></div>
<div>
<span style="background-color: white; color: #22863a; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 12px; white-space: pre;"><br /></span></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-14569058293109065872018-06-14T11:22:00.000+01:002018-06-14T16:51:32.323+01:00What's new in StormCrawler 1.10<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">StormCrawler 1.9 is only a couple of weeks old but the new functionalities added since justify a new release.</span></div>
<span style="font-family: Arial, Helvetica, sans-serif;"><b><br /></b></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Dependency upgrades</b></span><br />
<br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Apache Storm 1.2.2 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/583" target="_blank">#583</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Crawler-Commons 0.10 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/580" target="_blank">#580</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Elasticsearch 6.3.0 (<a href="https://github.com/DigitalPebble/storm-crawler/issues/587" target="_blank">#587</a>)</span></li>
</ul>
<br />
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Archetype</b></span><br />
<br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">parsefilters: added CommaSeparatedToMultivaluedMetadata to split <i>parse.keywords</i></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">bugfix: java topology in archetype does not use FeedParserBolt, fixes <a href="https://github.com/DigitalPebble/storm-crawler/issues/551" target="_blank">#551</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">bugfix: archetype - move SC dependency to first place to avoid STORM-2428, fixes <a href="https://github.com/DigitalPebble/storm-crawler/issues/559" target="_blank">#559</a></span></li>
</ul>
<br />
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Elasticsearch</b></span><br />
<br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">IndexerBolt set pipeline via config (<a href="https://github.com/DigitalPebble/storm-crawler/issues/584" target="_blank">#584</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Wrapper for loading JSON-based ParseFilters from ES (<a href="https://github.com/DigitalPebble/storm-crawler/issues/569" target="_blank">#569</a>) - see below</span></li>
</ul>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Core</b></span></div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">SimpleFetcherBolt to send URLs back to its own queue if time to wait above threshold (<a href="https://github.com/DigitalPebble/storm-crawler/issues/582" target="_blank">#582</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">ParseFilter to tag a document based on pattern matching on its URL (<a href="https://github.com/DigitalPebble/storm-crawler/issues/577" target="_blank">#577</a>)</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">New URL filter implementation based on JSON file and organised per hostname or domain <a href="https://github.com/DigitalPebble/storm-crawler/issues/578" target="_blank">#578</a></span></li>
</ul>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">Let's have a closer look at some of the points above.</span></div>
</div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">The <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/filter/CollectionTagger.java" target="_blank">CollectionTagger</a> is a ParseFilter provides a similar functionality to what Collections are in <a href="https://support.google.com/gsa/answer/6329145?hl=en" target="_blank">Google Search Appliance</a>, namely the ability to add a key value in the metadata based on the URL of a document matching one or more regular expressions. The rules are expressed in a JSON file and look like </span></div>
<div>
<br /></div>
<div>
<div>
<i>{</i></div>
<div>
<i> "collections": [{</i></div>
<div>
<i> "name": "stormcrawler",</i></div>
<div>
<i> "includePatterns": ["http://stormcrawler.net/.+"]</i></div>
<div>
<i> },</i></div>
<div>
<i> {</i></div>
<div>
<i> "name": "crawler",</i></div>
<div>
<i> "includePatterns": [".+crawler.+", ".+nutch.+"],</i></div>
<div>
<i> "excludePatterns": [".+baby.+", ".+spider.+"]</i></div>
<div>
<i> }</i></div>
<div>
<i> ]</i></div>
<div>
<i>}</i></div>
</div>
<div>
<br /></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">Please note that the format is different from what GSA does but it can achieve the same thing. </span></div>
<div>
<br /></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">So far, nothing revolutionary, the resource file gets loaded from the uber-jar, just like any other resource. However, what we introduced at the same time is the interface <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSONResource.java" target="_blank">JSONResource</a>, which CollectionTagger implements. This interface defines how implementations load a JSON file to build their resources.</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">Here comes the interesting bit. We added a new resource for Elasticsearch in <a href="https://github.com/DigitalPebble/storm-crawler/issues/569" target="_blank">#569</a> called </span><span style="text-align: left;"><span style="font-family: Arial, Helvetica, sans-serif;"><a href="https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/parse/filter/JSONResourceWrapper.java" target="_blank">JSONResourceWrapper</a>. As the name suggests, this wraps any ParseFilter implementing JSONResource and delegates the filtering to it. What it also does, is that it allows loading the JSON resource from an Elasticsearch document instead of the uber-jar and reloads it periodically. This allows you to<u style="font-weight: bold;"> update a resource without having to recompile the uber-jar and restart the topology</u>. </span></span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="text-align: left;"><span style="font-family: Arial, Helvetica, sans-serif;">The wrapper is configured in the usual way i.e via the parsefilter.json file, like so</span></span></div>
<div style="text-align: justify;">
<span style="text-align: left;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="text-align: left;"><div style="font-family: inherit; font-style: italic; text-align: justify;">
{</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"class": "com.digitalpebble.stormcrawler.elasticsearch.parse.filter.JSONResourceWrapper",</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"name": "ESCollectionTagger",</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"params": {</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"refresh": "60",</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"delegate": {</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"class": "com.digitalpebble.stormcrawler.parse.filter.CollectionTagger",</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"params": {</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
"file": "collections.json"</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
}</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
}</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
}</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
}</div>
<div style="font-family: inherit; font-style: italic; text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">The JSONResourceWrapper also needs to know where Elasticsearch lives. This is set via the usual configuration file:</span></div>
<div style="font-family: inherit; text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<div>
<span style="font-family: inherit;"><i> es.config.addresses: "localhost"</i></span></div>
<div>
<span style="font-family: inherit;"><i> es.config.index.name: "config"</i></span></div>
<div>
<span style="font-family: inherit;"><i> es.config.doc.type: "config"</i></span></div>
<div>
<span style="font-family: inherit;"><i> es.config.settings:</i></span></div>
<div>
<span style="font-family: inherit;"><i> cluster.name: "elasticsearch"</i></span></div>
<div>
<span style="font-family: inherit;"><i><br /></i></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">You can then push a modified version of the resources to Elasticsearch e.g. with CURL</span></div>
<div>
<span style="font-family: inherit;"><i><br /></i></span></div>
<div>
<span style="font-family: inherit;"><i>curl -XPUT 'localhost:9200/config/config/collections.json?pretty' -H 'Content-Type: application/json' -d @collections.json</i></span></div>
<div>
<span style="font-family: inherit;"><i><br /></i></span></div>
</div>
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">Another resource we introduced in this release is the <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/filtering/regex/FastURLFilter.java" target="_blank">FastURLFilter</a>, which also implements JSONResource (but as there isn't a Wrapper for URLFilters <a href="https://github.com/DigitalPebble/storm-crawler/issues/588" target="_blank">yet</a>, can't be loaded from ES). This is similar to the existing URL filter we have in that it allows to remove URLs based on regular expressions, however, it organises the rules per domain or hostname which makes it more efficient as a URL doesn't have to be checked against all the patterns, just the ones for its domain. There is even a scope based on metadata key/values, for instance, if some of your seeds were organised by collection, as well as a global scope which is tried for all URLs if nothing else matched.</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">The resource file looks like </span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<i><span style="font-family: inherit;"><div>
[</div>
<div>
{</div>
<div>
<span style="white-space: pre;"> </span>"scope": "GLOBAL",</div>
<div>
<span style="white-space: pre;"> </span>"patterns": [</div>
<div>
<span style="white-space: pre;"> </span>"DenyPathQuery \\.jpg"</div>
<div>
<span style="white-space: pre;"> </span>]</div>
<div>
<span style="white-space: pre;"> </span>},</div>
<div>
<span style="white-space: pre;"> </span>{</div>
<div>
<span style="white-space: pre;"> </span>"scope": "domain:stormcrawler.net",</div>
<div>
<span style="white-space: pre;"> </span>"patterns": [</div>
<div>
<span style="white-space: pre;"> </span>"AllowPath /digitalpebble/",</div>
<div>
<span style="white-space: pre;"> </span>"DenyPath .+"</div>
<div>
<span style="white-space: pre;"> </span>]</div>
<div>
<span style="white-space: pre;"> </span>},</div>
<div>
<span style="white-space: pre;"> </span>{</div>
<div>
<span style="white-space: pre;"> </span>"scope": "metadata:key=value",</div>
<div>
<span style="white-space: pre;"> </span>"patterns": [</div>
<div>
<span style="white-space: pre;"> </span>"DenyPath .+"</div>
<div>
<span style="white-space: pre;"> </span>]</div>
<div>
<span style="white-space: pre;"> </span>}</div>
<div>
]</div>
</span></i></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;">where the <i>Query </i>suffix indicates whether the pattern should be matched against the path + query element or just the path.</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;">I hope you like this new release of StormCrawler and the new features it brings. I would like to thank all the users and contributors and particularly the <a href="http://www.gov.nt.ca/" target="_blank">Government</a></span><span style="font-family: Arial, Helvetica, sans-serif;"><a href="http://www.gov.nt.ca/" target="_blank"> of Northwest Territories</a></span><span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;"> in Canada who kindly donated some of the code of the CollectionTagger.</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;">Happy Crawling!</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"> </span></div>
<div>
<br /></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-19418991258927685612018-05-25T16:30:00.002+01:002018-05-25T16:31:16.624+01:00What's new in StormCrawler 1.9<div dir="ltr" style="text-align: left;" trbidi="on">
<b><span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></b> <b><span style="font-family: "arial" , "helvetica" , sans-serif;">Dependency upgrades</span></b><br />
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span id="docs-internal-guid-f9e3a1d3-97c7-b05e-621d-6eb6d8e57522"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">OKHttp 3.10.0 </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/546" target="_blank">#546</a></span></li>
<li><span id="docs-internal-guid-df4e35eb-97c9-25c0-7028-f14f89806418"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">JSoup 1.11.2 </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/552" style="font-family: arial, helvetica, sans-serif;" target="_blank">#552</a></li>
<li><span id="docs-internal-guid-7647feba-97ca-57f2-f801-08fb453090f8"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">icu4j 61.1 </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/556" style="font-family: arial, helvetica, sans-serif;" target="_blank">#556</a></li>
<li><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">Rometools 1.9.0 </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/556" style="font-family: arial, helvetica, sans-serif;" target="_blank">#556</a></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">HTTPClient 4.5.5 <a href="https://github.com/DigitalPebble/storm-crawler/issues/558" target="_blank">#558</a></span></li>
<li><span id="docs-internal-guid-d3d78b72-97cd-c217-9b6c-3be1039b7588"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">Tika 1.18 </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/566" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#566</a></li>
</ul>
<b style="font-family: arial, helvetica, sans-serif;">Core</b><br />
<div>
<ul style="text-align: left;">
<li><span id="docs-internal-guid-b81adc62-97d0-7cbd-9c3f-9f95306f61d9"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">Crawl-delay in robots.txt should optionally not shrink the configured delay </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/549" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#549</a></li>
<li><span id="docs-internal-guid-4e292ef4-97d1-abb4-eb36-fe566d257e78"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">Optimisation: faster extraction of META tags </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/553" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#553</a></li>
<li><span id="docs-internal-guid-9ce5858d-97d4-50e5-9b22-7fc2af3436a5"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">CollectionMetric synchronized access to List </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/555" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#555</a></li>
<li><span id="docs-internal-guid-1233e842-97d5-1abc-1810-f4168457424b"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">Configurable Robots Caches </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/557" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#557</a></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span id="docs-internal-guid-5eb7fa4a-97dc-22f8-2c67-8f739e50e1fa"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">JSOUPParserBolt: lazy DOM conversion </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/563" target="_blank">#563</a></span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Purge internal queues of tuples which have already reached timeout <a href="https://github.com/DigitalPebble/storm-crawler/issues/564" target="_blank">#564</a></span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span id="docs-internal-guid-13f39c0e-97e5-664d-1c07-0f98bc85c4bf"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Added ParseFilter to convert single valued Metadata to multi-valued ones </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/571" target="_blank">#571</a></span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Caching of redirected robots.txt may overwrite correct robots.txt rules, fixes<span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><a href="https://github.com/DigitalPebble/storm-crawler/issues/573" target="_blank">#573</a></span></li>
</ul>
<div>
<b style="font-family: arial, helvetica, sans-serif;">WARC</b></div>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><b><span id="docs-internal-guid-1c5ea07c-97d9-48cb-a191-87857c1d849e" style="font-weight: normal;"><span style="font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">WARCBolt to handle incorrect URIs gracefully </span></span></b></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/560" style="font-family: Arial, Helvetica, sans-serif;" target="_blank">#560</a></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">WARCRecordFormat use ByteBuffer instead of ByteArrayOutputStream <a href="https://github.com/DigitalPebble/storm-crawler/issues/561" target="_blank">#561</a></span></li>
</ul>
<div>
<b style="font-family: arial, helvetica, sans-serif;">Archetype</b></div>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span id="docs-internal-guid-4eba0bdc-97de-8ea8-209c-b2a6320e0735"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Uses flux-core 1.2.1 </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/559" target="_blank">#559</a></span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span id="docs-internal-guid-229231bf-97df-91f0-1b05-7e47b738ad7a"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Added FeedParser to archetype topology </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/551" target="_blank">#551</a></span></li>
<li><span id="docs-internal-guid-fdada4ae-97e0-9a77-1a42-437217a818e1"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "arial" , "helvetica" , sans-serif;">Added .kml and .wmv to url filters</span></span></span></li>
</ul>
<div>
<b style="font-family: arial, helvetica, sans-serif;">SOLR</b></div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">MetricsConsumer handles recursive values <a href="https://github.com/DigitalPebble/storm-crawler/issues/554" style="background-color: white;" target="_blank">#554</a></span></li>
</ul>
<div>
<div>
<b style="font-family: arial, helvetica, sans-serif;">Elasticsearch</b></div>
<div>
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">MetricsConsumer handles recursive values <a href="https://github.com/DigitalPebble/storm-crawler/issues/554" style="background-color: white;" target="_blank">#554</a></span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">ES Indexer and Deletion Bolts to get index name from constructor <a href="https://github.com/DigitalPebble/storm-crawler/issues/#572" style="background-color: white;" target="_blank">#572</a></span></li>
</ul>
<div>
<div>
<b style="font-family: arial, helvetica, sans-serif;">LanguageID</b></div>
<div>
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span id="docs-internal-guid-83327159-97e8-6b0e-116f-0bf4f34ed15b"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Added option to LanguageID to skip if metadata already set </span></span><a href="https://github.com/DigitalPebble/storm-crawler/issues/570" style="background-color: white;" target="_blank">#570</a></span></li>
</ul>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">As usual, we advise all users to move to this version as it fixes several bugs. Thanks to all contributors and users. Happy crawling!</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-64952224152951644232018-03-23T10:52:00.002+00:002018-03-23T10:52:55.524+00:00Grafana StormCrawler metrics v4<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">The <a href="https://grafana.com/dashboards/2363">Grafana dashboard for StormCrawler</a> is a good starting point for monitoring the behaviour of your <a href="http://stormcrawler.net/">StormCrawler</a> topology. This is typically used with Elasticsearch as a storage backend for the metrics generated by Storm but should work with any other Storm-compatible backend like Grafite or CloudWatch. </span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Some of the metrics are specific to the components from the Elasticsearch module (spout, status, indexer) but you can simply remove or modify them if you use e.g. SOLR (NOTE: there was a <a href="https://github.com/grafana/grafana/issues/4422">feature request in Grafana</a> to add SOLR as a datasource but to my knowledge, this is not yet available).</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">The latest version (4) brings the following changes.</span><br />
<br />
<ul style="text-align: left;">
<li><b><span style="font-family: Arial, Helvetica, sans-serif;">URLs waiting in queues </span></b></li>
</ul>
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">The recent 1.8 release of StormCrawler added <a href="https://github.com/DigitalPebble/storm-crawler/issues/535">a new metrics for the FetcherBolt</a> which allows tracking the amount of time URLs spend in the internal queues. This has been added to the "URLs waiting in queues" panel alongside the average population of the queues.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Vll6rF9ji_Z8ggHcre9h4fjf8Mj4lywE6Cs6jM-wehbAuwajJ-tlbm03Zw0tvEXKO2Ic5KmJ0rLYCh1VmXNSdI3e7n-UPLYmFgSWxovcgnzJfiSYOK3EZXvhEexmcZbb5isBIJpUFY_c/s1600/waitingInQueue.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" data-original-height="640" data-original-width="1600" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Vll6rF9ji_Z8ggHcre9h4fjf8Mj4lywE6Cs6jM-wehbAuwajJ-tlbm03Zw0tvEXKO2Ic5KmJ0rLYCh1VmXNSdI3e7n-UPLYmFgSWxovcgnzJfiSYOK3EZXvhEexmcZbb5isBIJpUFY_c/s640/waitingInQueue.gif" width="640" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;">Average time spent in queues + average queues population</span></td></tr>
</tbody></table>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<ul>
<li><b><span style="font-family: Arial, Helvetica, sans-serif;">ES StatusUpdater</span></b></li>
</ul>
<span style="font-family: Arial, Helvetica, sans-serif;">Instead of tracking the number of bulk requests sent in the last minute, we now have a panel showing the evolution over time. This information is for the ES StatusUpdaterBolt only.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQb266HZeYVO9jCvjYWEJkKEQTgCqUFszKaiK8lJLzyQ8ApLUwSTkhBVcWDDOrkPmmU6BDofQnNacjhiaYGFUXNugyx8X6wJOBOW1BQI74YV5Hv26YqFN4NB5BQ9-VjJuqIYMysawAfDue/s1600/bulkSent.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" data-original-height="648" data-original-width="1600" height="258" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQb266HZeYVO9jCvjYWEJkKEQTgCqUFszKaiK8lJLzyQ8ApLUwSTkhBVcWDDOrkPmmU6BDofQnNacjhiaYGFUXNugyx8X6wJOBOW1BQI74YV5Hv26YqFN4NB5BQ9-VjJuqIYMysawAfDue/s640/bulkSent.gif" width="640" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;">ES status updater bulk requests</span></td></tr>
</tbody></table>
<ul>
<li><b><span style="font-family: Arial, Helvetica, sans-serif;">Acked in StatusBolt</span></b></li>
</ul>
<span style="font-family: Arial, Helvetica, sans-serif;">This is a brand new panel which is not specific to Elasticsearch but operates on any component with '<i>status</i>' for id and shows the number of tuples acked over time, broken down by source. </span><div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhy_gq2b3yQL42uNUrBIB2ktFNbxgHaFI8_QNw6WmaQCqWRzLyygyzF6zG5RQf9KwKCO-b1x0S9BVYI9DH9CU79ldpCSLuUz8gmxS74gMmMtCBb1GRiWu1tLlWh0vb-pwpnqkM0vRTAgSe-/s1600/acked.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" data-original-height="325" data-original-width="1600" height="130" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhy_gq2b3yQL42uNUrBIB2ktFNbxgHaFI8_QNw6WmaQCqWRzLyygyzF6zG5RQf9KwKCO-b1x0S9BVYI9DH9CU79ldpCSLuUz8gmxS74gMmMtCBb1GRiWu1tLlWh0vb-pwpnqkM0vRTAgSe-/s640/acked.gif" width="640" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif;">Tuples acked by StatusUpdater</span></td></tr>
</tbody></table>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">In the graph above, we can see a peak early in the crawl where most of the tuples acked came from the sitemap bolt. Please note that the values are stacked in this graph. Sitemap files are typically discovered early in a crawl and generate a large number of discovered URLs; this is not the case later on when most tuples come from the HTML parser.</span></div>
<div>
<ul>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><b>Robots panel</b></span></li>
</ul>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">We removed the robots panel as the number of HTTP requests to robots files is shown in the "<i>Fetcher: pages fetched</i>" panel anyway and after the initial few minutes of a crawl, the panel simply indicated that the robots files were mostly cached.</span></div>
<div>
<ul>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><b>ES Indexed </b></span></li>
</ul>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">This is a new panel showing the number of documents indexed into Elasticsearch as well as the documents filtered out during the indexing.</span></div>
</div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<br /></div>
</div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-17832657489673541372018-03-20T15:59:00.006+00:002018-03-20T15:59:45.468+00:00What's new in StormCrawler 1.8<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Arial, Helvetica, sans-serif;"><span style="font-family: Arial, Helvetica, sans-serif;">I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:</span><b><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></b></span><br />
<b><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></b>
<b><span style="font-family: Arial, Helvetica, sans-serif;">Dependency updates</span></b><br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Storm 1.2.1 <a href="https://github.com/DigitalPebble/storm-crawler/issues/531">#531</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">SOLR 7.2.1 <a href="https://github.com/DigitalPebble/storm-crawler/issues/528">#528</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Tika 1.17 <a href="https://github.com/DigitalPebble/storm-crawler/issues/518">#518</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><span style="font-family: Arial, Helvetica, sans-serif;">Elasticsearch 6.2.2 <a href="https://github.com/DigitalPebble/storm-crawler/pull/525">#525</a> and </span><a href="https://github.com/DigitalPebble/storm-crawler/pull/539">#539</a></span></li>
</ul>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Core</b></span></div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Add option to send only N bytes of text to indexers <a href="https://github.com/DigitalPebble/storm-crawler/issues/476">#476</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode <a href="https://github.com/DigitalPebble/storm-crawler/pull/522">#522</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">MemorySpout to generate tuples with DISCOVERED status <a href="https://github.com/DigitalPebble/storm-crawler/issues/529">#529</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">OKHttp configure type of proxy <a href="https://github.com/DigitalPebble/storm-crawler/issues/530">#530</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><i>http.content.limit</i> inconsistent default to -1 <a href="https://github.com/DigitalPebble/storm-crawler/pull/534">#534</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Track time spent in the FetcherBolt queues <a href="https://github.com/DigitalPebble/storm-crawler/issues/535">#535</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Increase <i>detect.charset.maxlength</i> default value <a href="https://github.com/DigitalPebble/storm-crawler/issues/537">#537</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">FeedParserBolt: metadata added by parse filters not passed forward in topology <a href="https://github.com/DigitalPebble/storm-crawler/issues/541">#541</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Use UTF-8 for input encoding of seeds (FileSpout) <a href="https://github.com/DigitalPebble/storm-crawler/pull/542">#542</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Default URL filter: exclude localhost and private address spaces <a href="https://github.com/DigitalPebble/storm-crawler/pull/543">#543</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">URLStreamGrouping returns the taskIDs and not their index <a href="https://github.com/DigitalPebble/storm-crawler/issues/547">#547</a></span></li>
</ul>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">WARC</span></b></div>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes <a href="https://github.com/DigitalPebble/storm-crawler/pull/520">#520</a></span></li>
</ul>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">SOLR</span></b></div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Schema for status index needs date type for nextFetchDate <a href="https://github.com/DigitalPebble/storm-crawler/issues/544">#544</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">SOLR indexer: use field type text for content field <a href="https://github.com/DigitalPebble/storm-crawler/issues/545">#545</a></span></li>
</ul>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif;">Elasticsearch</span></b></div>
</div>
<div>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">AggregationSpout fails with default value of es.status.bucket.field == _routing <a href="https://github.com/DigitalPebble/storm-crawler/issues/521">#521</a></span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Move to Elasticsearch RESTAPi <a href="https://github.com/DigitalPebble/storm-crawler/pull/539">#539</a></span></li>
</ul>
</div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">We recommend all users to move to this version as it fixes several bugs (<a href="https://github.com/DigitalPebble/storm-crawler/issues/541">#541</a>, <a href="https://github.com/DigitalPebble/storm-crawler/issues/547">#547</a>) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also <a href="https://github.com/DigitalPebble/storm-crawler/issues/535">#535</a> and <a href="https://github.com/DigitalPebble/storm-crawler/pull/543">#543</a>.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
</div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">As usual, thanks to all contributors and users. Happy crawling!</span></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-7289276537889495712017-11-28T12:53:00.000+00:002017-11-28T12:53:42.458+00:00What's new in StormCrawler 1.7<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<br />
Amazingly this is the 20th release of <a href="http://stormcrawler.net/">StormCrawler</a>! Here are the main changes:<br />
<br />
<b>Dependencies updates</b><br />
<ul style="text-align: left;">
<li>crawler-commons 0.9 <a href="https://github.com/DigitalPebble/storm-crawler/issues/513">#513</a></li>
</ul>
<b>Core</b><br />
<ul style="text-align: left;">
<li>(bugfix) ParserBolts should use outlinks from parsefilters <a href="https://github.com/DigitalPebble/storm-crawler/issues/498">#498</a></li>
<li>LD_JSON parsefilter <a href="https://github.com/DigitalPebble/storm-crawler/pull/501">#501</a></li>
<li>okhttp : store request and response headers verbatim in metadata <a href="https://github.com/DigitalPebble/storm-crawler/issues/506">#506</a></li>
<li>(bugfix) okhttp protocol does not store headers in metadata <a href="https://github.com/DigitalPebble/storm-crawler/issues/507">#507</a></li>
<li>HTTP clients should handle http.accept.language and http.accept <a href="https://github.com/DigitalPebble/storm-crawler/issues/499">#499</a></li>
<li>Selenium protocol follows redirections <a href="https://github.com/DigitalPebble/storm-crawler/issues/514">#514</a></li>
<li>RemoteDriverProtocol needs multiple instances <a href="https://github.com/DigitalPebble/storm-crawler/issues/505">#505</a></li>
<li>SitemapParserBolt should force mime-type based on the clue <a href="https://github.com/DigitalPebble/storm-crawler/issues/515">#515</a></li>
</ul>
<div>
<b>Elasticsearch</b></div>
<div>
<ul style="text-align: left;">
<li>ES Spout : define filter query via config <a href="https://github.com/DigitalPebble/storm-crawler/issues/502">#502</a></li>
<li>Upgrade to ES 6.0 <a href="https://github.com/DigitalPebble/storm-crawler/pull/517">#517</a></li>
</ul>
<div>
We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.</div>
</div>
<div>
<br /></div>
<div>
This version improves the processing of sitemaps, via <a href="https://github.com/DigitalPebble/storm-crawler/issues/515">#515</a> and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our <a href="http://square.github.io/okhttp/">okhttp</a>-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on <a href="https://github.com/DigitalPebble/storm-crawler/wiki/Protocols">the WIKI</a>.</div>
<div>
<br /></div>
<div>
Finally, if you want to extract semantic data represented in ld-json then you'll love <a href="https://github.com/DigitalPebble/storm-crawler/pull/501">#501</a>.</div>
<div>
<br /></div>
<div>
As usual, thanks to all contributors and users. Happy crawling!</div>
<div>
<br /></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-3196799874329749822017-09-08T14:10:00.002+01:002017-11-28T11:25:42.838+00:00What's new in StormCrawler 1.6<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<b>Dependencies updates</b><br />
<br />
<ul style="text-align: left;">
<li>jsoup 1.10.3</li>
<li>crawler-commons 0.8</li>
</ul>
<br />
<b>Core</b><br />
<br />
<ul style="text-align: left;">
<li>Use ISO representation of time for modifiedtime in adaptivescheduler <a href="https://github.com/DigitalPebble/storm-crawler/issues/496">#496</a></li>
<li>Use ISO representation of time for discoveryDate and lastProcessedDate, <a href="https://github.com/DigitalPebble/storm-crawler/issues/477">#477</a></li>
<li>Improved Charset Detection <a href="https://github.com/DigitalPebble/storm-crawler/issues/495">#495</a></li>
<li>SitemapParserBolt configure use SAX or not</li>
<li>SitemapParserBolt generates metrics for average processing time</li>
<li>HTTP protocol based on OKHTTP <a href="https://github.com/DigitalPebble/storm-crawler/issues/484">#484</a> </li>
<li>Apache Http client can use HEAD method on a per URL basis <a href="https://github.com/DigitalPebble/storm-crawler/issues/485">#485</a></li>
<li>ContentFilter to leave trace of the pattern that matched <a href="https://github.com/DigitalPebble/storm-crawler/issues/480">#480</a></li>
<li>Metadata has a new public method for getting first non-empty value from a set of keys</li>
<li>Added ARTICLE to patterns for content filter</li>
</ul>
<br />
<b>LangID</b><br />
<br />
<ul style="text-align: left;">
<li>Can add more than one lang code based on configurable prob threshold. <a href="https://github.com/DigitalPebble/storm-crawler/issues/481">#481</a></li>
</ul>
<br />
<b>WARC</b><br />
<br />
<ul style="text-align: left;">
<li> Added rotation policy based on time and filesize</li>
</ul>
<br />
<b>ES</b><br />
<br />
<ul style="text-align: left;">
<li>ES: <i>added es.status.reset.fetchdate.after</i> <a href="https://github.com/DigitalPebble/storm-crawler/issues/478">#478</a></li>
<li>Removed Grafana resources - can be downloaded from <a href="https://grafana.com/dashboards/2363">Grafana portal</a></li>
</ul>
<br />
<div>
<br /></div>
<div>
<br /></div>
<br />
<br />
<div>
<br /></div>
<div>
<br /></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-86490102967252526512017-05-29T14:17:00.000+01:002017-05-29T14:17:15.255+01:00What's new in StormCrawler 1.5<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "arial";"><a href="http://stormcrawler.net/">StormCrawler</a> 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!</span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "arial";"><br /></span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "arial";">The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on <a href="https://stackoverflow.com/questions/tagged/stormcrawler">StackOverflow</a>.</span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "arial";"><br /></span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "arial";">Here are the main changes in 1.5.</span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">CORE DEPENDENCIES UPGRADES</span></div>
<ul style="margin-bottom: 0pt; margin-top: 0pt;">
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Apache Storm 1.1.0 (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/450" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#450</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
</ul>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">CORE MODULE</span></div>
<ul style="margin-bottom: 0pt; margin-top: 0pt;">
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">HTTP Protocol: implement cookie handling (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/32" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#32</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/455" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#455</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Selenium-based protocol implementation (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/144" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#144</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">) which I described </span><a href="http://digitalpebble.blogspot.co.uk/2017/04/crawl-dynamic-content-with-selenium-and.html" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">in a separate blog post</span></a></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Indicate whether RobotsRules come from cache or have been fetched (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/460" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#460</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/462" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#462</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">FetcherBolt to dump URLs being fetched to log (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/464" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#464</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Override sitemapsAutoDiscovery settings per URL (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/469" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#469</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
</ul>
<div>
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><br /></span></span></div>
<div>
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;">Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)</span></span></div>
<div>
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><br /></span></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYRuDH-PoGa3l8YC5Qic-MkPUzjjZQHYvxwlRUQxhyjLrmgpu11ddwSYAM-F9utn0B-nNK0lft9FFZzoGGFO9EtOiGH6caHrH2RhceXkyz2x85Wd3rdIqekw3mqr3mklkE6VkC1QFcD41N/s1600/robots.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="245" data-original-width="945" height="163" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYRuDH-PoGa3l8YC5Qic-MkPUzjjZQHYvxwlRUQxhyjLrmgpu11ddwSYAM-F9utn0B-nNK0lft9FFZzoGGFO9EtOiGH6caHrH2RhceXkyz2x85Wd3rdIqekw3mqr3mklkE6VkC1QFcD41N/s640/robots.png" width="640" /></a></div>
<div>
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><br /></span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="font-family: "arial"; font-size: 14.6667px; white-space: pre-wrap;">as well as pages fetched vs robots fetched.</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="font-family: "arial"; font-size: 14.6667px; white-space: pre-wrap;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ5mzHeP5j8w35ROP2X9b5pRe6YslVZf1uEtzBNbwEx1l6789Ejy4Om8qaQoyi4EkHPGsEeOMtB3jN-dgj4pwDKXdJ9zpYHxfRH5JNGK8txvem6wT_d4TivcsyYa3ISf7_modm-Kc-u3JX/s1600/robots2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="352" data-original-width="815" height="276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ5mzHeP5j8w35ROP2X9b5pRe6YslVZf1uEtzBNbwEx1l6789Ejy4Om8qaQoyi4EkHPGsEeOMtB3jN-dgj4pwDKXdJ9zpYHxfRH5JNGK8txvem6wT_d4TivcsyYa3ISf7_modm-Kc-u3JX/s640/robots2.png" width="640" /></a></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="font-family: "arial"; font-size: 14.6667px; white-space: pre-wrap;"> </span> </div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">ELASTICSEARCH</span></div>
<ul style="margin-bottom: 0pt; margin-top: 0pt;">
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Utility class to export URL and metadata from ES index to file (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/444" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#444</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Fixed sampling with aggregation spout in ES5 </span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Upgrade to Elasticsearch 5.3 (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/221" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#221</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and </span><a href="https://github.com/DigitalPebble/storm-crawler/pull/451" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#451</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Optimise </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">nextFetchDate</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> to speed up queries to Elasticsearch (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/429" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#429</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and </span><a href="https://github.com/DigitalPebble/storm-crawler/pull/452" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#452</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Delete gone pages from index (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/253" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#253</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">metrics - remove filtering (</span><a href="https://github.com/DigitalPebble/storm-crawler/issues/281" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">#281</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">)</span></div>
</li>
</ul>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div style="text-align: justify;">
<span id="docs-internal-guid-d2adfada-53e0-8f23-a7fb-3c3b6350232f"><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (</span><a href="https://github.com/DigitalPebble/storm-crawler/commit/9fefac8" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">9fefac8</span></a><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">), improved the efficiency of the spouts by getting them to process results in a separate thread (</span><a href="https://github.com/DigitalPebble/storm-crawler/commit/1b0fb42" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">1b0fb42</span></a><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">), which combined with the optimisation of nextFetchDate (see above) and the fix of the </span><a href="https://github.com/DigitalPebble/storm-crawler/commit/c5a04d8" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">sampling</span></a><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> in AggregationSpout, means that the Elasticsearch module is more efficient than ever.</span></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the <a href="http://digitalpebble.blogspot.co.uk/2017/05/avoid-common-pitfalls-when-upgrading.html">common pitfalls of upgrading an existing topology to Elasticsearch 5</a>.</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div style="text-align: left;">
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><b>Coming next?</b></span></span></div>
<div style="text-align: left;">
<span style="font-family: "arial"; font-size: 14.6667px; text-align: justify; white-space: pre-wrap;">As usual, it is hard to guess what the next release will be made of as the project is driven by its community.</span></div>
<div style="text-align: left;">
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><b><br /></b></span></span></div>
<div style="text-align: left;">
<span style="font-family: "arial"; font-size: 14.6667px; text-align: justify; white-space: pre-wrap;">Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (<a href="https://github.com/DigitalPebble/storm-crawler/issues/443">#443</a>). As mentioned in the <a href="http://digitalpebble.blogspot.co.uk/2017/03/whats-new-in-stormcrawler-14.html">previous release</a>, we'll probably upgrade </span><span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;">to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.</span></span></div>
<div style="text-align: left;">
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><br /></span></span></div>
<div style="text-align: left;">
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"></span></span></div>
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;">In the meantime and as usual, thanks to all contributors and users and happy crawling!</span></span><br />
<div>
<br /></div>
<div style="text-align: left;">
<span style="font-family: "arial"; font-size: 14.6667px; text-align: justify; white-space: pre-wrap;"><br /></span></div>
<div style="text-align: left;">
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><b><br /></b></span></span></div>
<div style="text-align: left;">
<span style="font-family: "arial";"><span style="font-size: 14.6667px; white-space: pre-wrap;"><b><br /></b></span></span></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-33482011309364947372017-05-15T14:59:00.002+01:002017-05-15T15:03:50.867+01:00Avoid these common pitfalls when upgrading StormCrawler with Elasticsearch 5.x<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">The next (and probably imminent) release of StormCrawler will contain an update of Elasticsearch to version 5.3. This is definitely a good thing, as we want to keep up with the latest versions of Elasticsearch but has a few pitfalls when upgrading your existing application. Some of the changes are documented in the <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/README.md">README</a> but I will reiterate them here, just in case.</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>LOG4J dependencies</b></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">ES5 requires an upgrade in the logging dependencies of Apache Storm. You can update the dependencies in your existing Storm cluster by hand but since <a href="https://issues.apache.org/jira/browse/STORM-2326">my patch</a> is part of Storm 1.1.0, you should probably upgrade Storm altogether. StormCrawler 1.5 will depend on Storm 1.1.0 (but probably works with older versions as well).</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Maven Shade Configuration</b></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">The pom file of your StormCrawler-based project needs modifying as well, you'll need to specify the Maven Shade Configuration and include:</span></div>
<pre style="background-color: #f6f8fa; border-radius: 3px; box-sizing: border-box; color: #24292e; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; font-stretch: normal; line-height: 1.45; overflow: auto; padding: 16px; word-break: normal; word-wrap: normal;"><<span class="pl-ent" style="box-sizing: border-box; color: #63a35c;">manifestEntries</span>>
<<span class="pl-ent" style="box-sizing: border-box; color: #63a35c;">Change</span>></<span class="pl-ent" style="box-sizing: border-box; color: #63a35c;">Change</span>>
<<span class="pl-ent" style="box-sizing: border-box; color: #63a35c;">Build-Date</span>></<span class="pl-ent" style="box-sizing: border-box; color: #63a35c;">Build-Date</span>>
</<span class="pl-ent" style="box-sizing: border-box; color: #63a35c;">manifestEntries</span>></pre>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">See <a href="https://github.com/elastic/elasticsearch/issues/21627">https://github.com/elastic/elasticsearch/issues/21627</a>; this wasn't an issue with the previous versions of Elasticsearch.</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b style="font-family: Arial, Helvetica, sans-serif;">Update es-conf.yaml</b></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">In particular, the value of </span><span style="background-color: white; color: #63a35c; font-family: , "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; white-space: pre;">es.status.bucket.field</span><span style="background-color: #eaffea; color: #63a35c; font-family: , "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; white-space: pre;"> </span><span style="background-color: #eaffea; text-align: start;"><span style="font-family: "arial" , "helvetica" , sans-serif;">used to be </span></span><span style="background-color: white; color: #183691; font-family: , "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; text-align: left; white-space: pre;">_routing</span><span style="font-family: "arial" , "helvetica" , sans-serif;">, which is an automatically generated field, however this is not available for the spouts anymore. Instead, use the same value as </span><span style="text-align: left;"><span style="background-color: white; color: #63a35c; font-family: , "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; white-space: pre;">es.status.routing.</span><span style="background-color: white; color: #63a35c; font-family: , "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; white-space: pre;">fieldname</span><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white;"> e.g. </span></span></span><span style="background-color: #eaffea; color: #183691; font-family: , "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; text-align: left; white-space: pre;">metadata.hostname</span><span style="font-family: "arial" , "helvetica" , sans-serif;">. </span></div>
<div style="text-align: justify;">
<span style="text-align: left;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white;"><br /></span></span></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Mapping</b></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">ES5 should be able to read your existing indices, however, if you create a new set of indices from scratch, make sure you use the latest <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/ES_IndexInit.sh">version of the script</a>.</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">I hope this will help you for a successful upgrade, I will cover the new functionalities and improvements coming with StormCrawler 1.5 when it is released.</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Happy crawling</span></div>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
</div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0tag:blogger.com,1999:blog-6540289076858785139.post-74914633848202734242017-04-27T10:39:00.000+01:002017-11-07T11:55:38.647+00:00Crawl dynamic content with Selenium and StormCrawler<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Many websites rely on AJAX to provide smooth and reactive web applications and/or single page websites. While this works fine for humans using modern browsers, this is often challenging for robots as they can’t interpret the Javascript and usually rely on low-level HTTP protocol implementations to get the binary content. Even Google have </span><a href="https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">announced only as recently as October 2015</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> that their crawlers can handle dynamic content, even though </span><a href="http://searchengineland.com/can-now-trust-google-crawl-ajax-sites-235267" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">tests have shown</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> that this is still far from being perfect.</span><br />
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Support for dynamic content is something that many users have asked for in StormCrawler and I am pleased to announce that we have </span><a href="https://github.com/DigitalPebble/storm-crawler/pull/457" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">recently committed code for this</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. The next release of StormCrawler (1.5) will contain a </span><a href="http://www.seleniumhq.org/projects/webdriver/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Selenium WebDriver</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">-based protocol implementation so let’s have a sneak preview of how to use it and what it can do for you.</span><br />
<span style="font-family: "arial"; font-size: 11pt; font-weight: 700; text-align: left; white-space: pre-wrap;"><br /></span> <span style="font-family: "arial"; font-size: 11pt; font-weight: 700; text-align: left; white-space: pre-wrap;">Prerequisites</span><br />
<span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span> <span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">The instructions below are based on Linux commands. You will need to install Java 8 and Maven to compile StormCrawler as well as </span><a href="http://phantomjs.org/" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">PhantomJS</span></a><span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> (2.1.1 or above), which we will connect to via WebDriver. You might want to install Apache Storm, even though this is not a strict requirement as we’ll see below.</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Until StormCrawler 1.5 is released, you will need to get the master branch, either with Git or by downloading the code from </span><a href="https://github.com/DigitalPebble/storm-crawler" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">https://github.com/DigitalPebble/storm-crawler</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. Once this is done, cd to storm-crawler and run `mvn clean install`. This should put the storm-crawler artefacts in your local Maven repository, ready to use for the next step. This won’t be needed once 1.5 is released and you will be able to get the artefacts straight from Maven Central.</span><br />
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Simple example</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span> <span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Let’s first build a StormCrawler project using the Maven archetype:</span><br />
<br />
<div style="text-align: left;">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><i>mvn</i><span style="font-style: italic; text-align: left; white-space: pre-wrap;"> archetype:generate -B -DarchetypeGroupId=com.digitalpebble.</span>stormcrawler<span style="font-style: italic; text-align: left; white-space: pre-wrap;"> -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5 -DgroupId=com.digitalpebble.crawl -DartifactId=selenium-tutorial -Dversion=1.0-SNAPSHOT -Dpackage=com.digitalpebble.crawl</span></span></div>
</div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This will give you a basic set of resources and configuration for StormCrawler. Go to the selenium-tutorial directory and build the uber jar with `</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">mvn clean package</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">`. We are now ready to go with a simple example.</span><br />
<span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span> <span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Edit the file </span><span style="color: black; font-family: "arial"; font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">crawler.flux </span><span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">and set </span><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><a href="https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing" style="text-decoration-line: none;">https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing</a></span><span style="font-family: "arial"; font-size: 11pt; white-space: pre-wrap;"> as value for the constructorArgs in the spout config as shown below:</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><img height="91" src="https://lh5.googleusercontent.com/8SFJv5FIguXTVDhUl8bu3l8z5K0Et1Rjm-YoIzjFuBHWBbQ3R9BuheQ4G1ZdHJk3N1jp3igR8hW2G0Xa5xzn9uqRbEJfbjD-xqv-LL9fb3JXVcQEWQBhzZkQzZMbQu9dApgKbF6Q" style="border: none; transform: rotate(0rad);" width="602" /></span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">If you look at the source of that page, you’ll see that it consists mostly of Javascript. Fine for our browsers, but how does StormCrawler fare on it? With Storm installed and accessible on the command line, let’s do</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This will start the topology defined in the Flux file and let it run for one minute.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Note</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">: the command above assumes that you have installed Storm. Alternatively, you can run the code directly with Maven like so:</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">mvn clean compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="--local crawler.flux --sleep 60000"</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The console will display a lot of logs about the components being initialised but also the status of the URLs (e.g. FETCHED, DISCOVERED, etc...), the fields extracted from the documents fetched and various metrics. To remove the latter, you can comment out the section </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">topology.metrics.consumer.register </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">in crawler-conf.yaml.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Tip</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">: if you are feeling adventurous, have a look at the other entries from the conf files e.g. remove domain=domain from</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><i> indexer.md.mapping</i></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and see how that affects the output below.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Regardless of whether you ran the topology using Storm or Maven, you should see an output similar to this:</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">content</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">url</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">domain</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">dagbladet.no</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">description</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">title</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Dagbladet Mat</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">FETCHED</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Thu Apr 27 14:46:59 BST 2017</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The first 5 lines were generated by the </span><a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/indexing/StdOutIndexer.java" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">StdOutIndexer</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and as we can see, no text content was generated at all, the title is a generic one and no other fields could be extracted. Further down, a</span><span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> single line was generated by the </span><a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/persistence/StdOutStatusUpdater.java" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">StdOutStatusUpdater</span></a><span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">, indicating that the URL was successfully fetched, however, no outlinks were discovered at all (we would have seen lines with a DISCOVERED status).</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Selenium to the rescue</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Time to put our brand new protocol implementation to use. Edit the file </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">crawler-conf.yaml </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">and add</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> selenium.addresses: "http://localhost:9515"</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This tells StormCrawler to use the custom protocol implementations and connect to a WebDriver server on port 9515. </span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Open a different console and run `</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">phantomjs --webdriver 9515` </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">then run the topology again and look at the output</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">content</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">2873 chars</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">url</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">keywords</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">mat,oppskrift,kokker,råvarer,ingredienser,bakt,potet,med,rømme-,og,blåmuggostdressing</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">domain</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">dagbladet.no</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">description</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">title</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Bakt potet med rømme- og blåmuggostdressing - Oppskrift | Dagbladet Mat</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This time we got some textual content, the correct title and were able to extract keywords. As you’ve certainly noticed, we got all sorts of outlinks, similar to what we can observe with a browser.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">What happened under the bonnet is that PhantomJS gave us a fully interpreted HTML page, on which we ran our JSoup parser. The latter used the </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">ParseFilters</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> defined in </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">src/main/resources/parsefilters.json </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">to extract the metadata displayed by the indexer later on (i.e. title, description, domain, keywords, canonical).</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Let’s now look at a slightly more complex scenario.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">NavigationFilters</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Websites often use Javascript for interactions within a page and navigation through the content. If we look at </span><a href="https://rn12.ultipro.com/SOU1022/JobBoard/ListJobs.aspx" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">https://rn12.ultipro.com/SOU1022/JobBoard/ListJobs.aspx</span></a><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> for instance, we can see that the pagination for the result lists is done in Javascript. Assuming that we want to extract all the jobs listed for that board, we would be able to get the links from the initial page with the simple HTTP protocol implementation but not the links to the following result pages as they are handled with AJAX.</span><br />
<span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span> <span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Luckily, we can implement the navigation logic by implementing a class extending </span><a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/protocol/selenium/NavigationFilter.java" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">NavigationFilter</span></a><span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">. First, let’s create a new file </span><span style="color: black; font-family: "consolas"; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">JobBoardNavigationFilter.java </span><span style="color: black; font-family: "consolas"; vertical-align: baseline; white-space: pre-wrap;">in <i>src/main/java/com/digitalpebble/crawl</i> </span><span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">and fill it with the content below</span></div>
<b style="font-weight: normal;"><br /></b> <script src="https://gist.github.com/jnioche/5f595e41867e236e27efb45a90c5062d.js"></script> <b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Tip: </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">wget "https://s.apache.org/mOkz" -O src/main/java/com/digitalpebble/crawl/JobBoardNavigationFilter.java</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The approach used here it to generate a dummy HTML content and create links for all the job pages, while iterating on the result pages. This class gets called by the Selenium-based protocol implementation.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Now, let’s create a new file </span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">navigationfilters.json</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> in the directory resources and give it the following content</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">{</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> "com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters": [</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> {</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> "class": "com.digitalpebble.crawl.JobBoardNavigationFilter",</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> "name": "JobBoard"</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> }</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> ]</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">}</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Finally, we specify the name of the file we just created in the config with</span><br />
<span style="font-family: "consolas"; font-size: 9pt; font-style: italic; white-space: pre-wrap;"><br /></span> <span style="font-size: x-small;"><span style="font-family: "consolas"; font-style: italic; white-space: pre-wrap;">navigationfilters.config.file: </span>navigationfilters<span style="font-family: "consolas"; font-style: italic; white-space: pre-wrap;">.json</span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "consolas"; font-size: 9pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Don’t forget to recompile the code with `mvn clean package` before launching the crawl. This time we’ll just check that we get all the links to the job pages in one go.</span><br />
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: italic; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000 | grep DISCOVERED</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt; text-align: justify;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Note</span><span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">: why not download <a class="" href="https://sites.google.com/a/chromium.org/chromedriver/">chromedriver</a> and use it instead of PhantomJS? By default, chromedriver does not run in headless mode so you could see the browser being driven by the navigation filter, including the stuff you usually don’t notice, like the robots.txt file being fetched.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Conclusion</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The resources covered here are the very first step towards making StormCrawler handle dynamic content and there is much work to do on improving it, however, the brand new protocol based on Selenium should already be a useful starting point. I hope you'll give it a try, happy crawling!</span></div>
<br />
<br />
<br /></div>
Julien Niochehttp://www.blogger.com/profile/16499716503708780310noreply@blogger.com0