The Big Picture Of How Crawling Works

A blog post describing how crawler data is collected and organized. This blog post discusses the different types of data that most website analytics software collect, including possible sources for this data as well as how to get insights from it.
This is an article about crawler. If you have any questions, ask them in the comments below.
Some users have already been interested in learning how the particular crawler data on the crawler-aware site is organized, now we will be more than inquisitive to reveal exactly how the crawler data is collected in addition to organized.

We may reverse the IP address of the crawler to query the rDNS, by way of example: all of us find this IP: 116. 179. thirty-two. 160, rDNS by simply reverse DNS search tool: baiduspider-116-179-32-160. crawl. baidu. com

To sum up, we can approximately determine should be Baidu google search bots. Because Hostname can be forged, and we only reverse search, still not accurate. We also want to forward search, we ping command to find baiduspider-116-179-32-160. crawl. baidu. apresentando could be resolved as: 116. 179. thirty-two. 160, through the particular following chart can be seen baiduspider-116-179-32-160. crawl. baidu. apresentando is resolved in order to the Internet protocol address 116. 179. 32. one hundred sixty, which means that will the Baidu search engine crawler is usually sure.

Searching simply by ASN-related information

Not all crawlers follow typically the above rules, the majority of crawlers reverse lookup without any effects, we need in order to query the IP address ASN info to determine if the crawler details is correct.

For instance , this IP is 74. 119. 118. 20, we could see this IP address is typically the IP address of Sunnyvale, California, USA by querying the IP information.

We could see by the particular ASN information that he is surely an IP of Criteo Corp.

The screenshot over shows the logging information of critieo crawler, the yellowish part is the User-agent, followed by their IP, and there is nothing wrong using this access (the IP is indeed the IP address of CriteoBot).

IP address segment published by the crawler’s official documents

Some crawlers distribute IP address segments, and that we save the particular officially published IP address segments of the crawler right to the database, that is an easy and fast way to be able to do this.

Via public logs

We are able to often view public logs on typically the Internet, for example , the following image is actually a public log report I found.

We can parse typically the log records in order to determine which are crawlers and which often are visitors dependent on the User-agent, which greatly enriches our database regarding crawler records.

Overview

The above four procedures detail how the particular crawler identification website collects and organizes crawler data, plus how to ensure the accuracy in addition to reliability of the particular crawler data, but of course right now there are not just the above four procedures in the real operation process, but they are fewer used, so these people are not introduced in this article.

About the Author

3 thoughts on “The Big Picture Of How Crawling Works

  1. The crawler data is all collected in one place, and I am the only person that has access to the data.

  2. I recently came across a blog that was helpful to me in my research. The blog explained crawler data and how it is collected. I found the information especially helpful because I always had a hard time understanding how this data is collected.

Comments are closed.

You may also like these

[tp widget="default/tpw_default.php"]