Web Crawling is extremely useful to automate certain tasks performed routinely on websites. You can write a crawler to interact with a website just as a human would do.

In an earlier article, we covered the basics of writing a webcrawler using the python module, scrapy. The limitation of that approach is that the crawler does not support javascript. It will not work properly with those websites that make heavy use of javascript to manage the user interface. For such situations, you can write a crawler which uses Google Chrome and hence can handle javascript just like a normal user-driven Chrome browser.

Automating Google Chrome involves use of a tool called Selenium. It is a software component which sits between your program and the Browser, and helps you drive the browser through your program. In this article, we take you through the complete process of automating Google Chrome. The steps generally include:

  • Setting up Selenium
  • Using Google Chrome Inspector to identify sections of the webpage
  • Writing a java program to automate Google Chrome

For the purpose of the article, let us investigate how to read Google Mail from java. While Google does provide an API (Application Programming Interface) to read mail, in this article we use Selenium to interact with Google Mail for demonstrating the process. Google Mail makes heavy use of javascript, and is thus a good candidate for learning Selenium.

Setting Up Selenium

Web Driver

As explained above, Selenium consists of a software component that runs as a separate process and performs actions on behalf of the java program. This component is called Web Driver and must be downloaded onto your computer.

Click here to go to the Selenium download site, click on the latest release and download the appropriate file for your computer OS (Windows, Linux, or MacOS). It is a ZIP archive containing chromedriver.exe. Extract it to a suitable location such as C:\WebDrivers\chromedriver.exe. We will use this location later in the java program.

Java Modules

Next step is to set up the java modules required to use Selenium. Assuming you are using Maven to build the java program, add the following dependency to your POM.xml.

         <dependencies>
 <dependency>
 <groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-java</artifactId>
 <version>3.8.1</version>
 </dependency>
 </dependencies>

When you run the build process, all the required modules should be downloaded and set up on your computer.

Selenium First Steps

Let us get started with Selenium. The first step is to create a ChromeDriver instance:

        WebDriver driver = new ChromeDriver();

That should open a Google Chrome window. Let us navigate to the Google search page.

        driver.get("http://www.google.com");

Obtain a reference to the text input element so we can perform a search. The text input element has the name q. We locate HTML elements on the page using the method WebDriver.findElement().

        WebElement element = driver.findElement(By.name("q"));

You can send text to any element using the method sendKeys(). Let us send a search term and end it with a newline so the search begins immediately.

        element.sendKeys("terminator\n");

Now that a search is in progress, we need to wait for the results page. We can do that as follows:

        new WebDriverWait(driver, 10)
 .until(d -> d.getTitle().toLowerCase().startsWith("terminator"));

This code basically tells Selenium to wait for 10 seconds and return when the page title starts with terminator. We use a lambda function to specify the condition to wait for.

Now we can get the title of the page.

        System.out.println("Title: " + driver.getTitle());

Once you are done with the session, the browser window can be closed with:

        driver.quit();

And that, folks, is a simple browser session controlled using java via selenium. Seems quite simple, but enables you to program a lot of things that normally you would have to do by hand.

Using Google Chrome Inspector

Google Chrome Inspector is an invaluable tool to identify elements to be used with Selenium. It allows us to target the exact element from java for extracting information as well as an interactive action such as clicking a button. Here is a primer on how to use the Inspector.

Open Google Chrome and navigate to a page, say the IMDb page for Justice League (2017).

Let us find the element that want to target, say the movie summary. Right click on the summary and select "Inspect" from the popup menu.

how to make a web crawler with selenium

From the "Elements" tab, we can see that the summary text is a div with a class of summary_text.

how to make a web crawler with selenium

Using CSS or XPath for Selection

Selenium supports selecting elements from the page using CSS. (CSS dialect supported is CSS2). For example to select the summary text from the IMDb page above, we would write:

        WebElement summaryEl = driver.findElement(By.cssSelector("div.summary_text"));

You can also use XPath to select elements in a very similar way (Go here for the specs). Again, to select the summary text, we would do:

        WebElement summaryEl = driver.findElement(By.xpath("//div[@class='summary_text']"));

XPath and CSS have similar capabilities so you can use whichever you are comfortable with.

Reading Google Mail From Java

Let us now look into a more complex example: fetching Google Mail.

Start the Chrome Driver, navigate to gmail.com and wait until the page is loaded.

        WebDriver driver = new ChromeDriver();
driver.get("https://gmail.com");
new WebDriverWait(driver, 10)
 .until(d -> d.getTitle().toLowerCase().startsWith("gmail"));

Next, look for the email field (it is named with the id identifierId) and enter the email address. Click the Next button and wait for the password page to load.

        /* Type in username/email */
{
 driver.findElement(By.cssSelector("#identifierId")).sendKeys(email);
 driver.findElement(By.cssSelector(".RveJvd")).click();
}

new WebDriverWait(driver, 10)
 .until(d -> ! d.findElements(By.xpath("//div[@id='password']")).isEmpty() );

Now, we enter the password, click the Next button again and wait for the Gmail page to load.

        /* Type in password */
{
 driver
 .findElement(By.xpath("//div[@id='password']//input[@type='password']"))
 .sendKeys(password);
 driver.findElement(By.cssSelector(".RveJvd")).click();
}

new WebDriverWait(driver, 10)
 .until(d -> ! d.findElements(By.xpath("//div[@class='Cp']")).isEmpty() );

Fetch the list of email rows and loop over each entry.

        List<WebElement> rows = driver
 .findElements(By.xpath("//div[@class='Cp']//table/tbody/tr"));
for (WebElement tr : rows) {
}

For each entry, fetch the From field. Note that some From entries could have multiple elements depending on the number of people in the conversation.

        {
 /* From Element */
 System.out.println("From: ");
 for (WebElement e : tr
 .findElements(By.xpath(".//div[@class='yW']/*"))) {
 System.out.println(" " +
 e.getAttribute("email") + ", " +
 e.getAttribute("name") + ", " +
 e.getText());
 }
}

Now, fetch the subject.

        {
 /* Subject */
 System.out.println("Sub: " + tr.findElement(By.xpath(".//div[@class='y6']")).getText());
}

And the date and time of the message.

        {
 /* Date/Time */
 WebElement dt = tr.findElement(By.xpath("./td[8]/*"));
 System.out.println("Date: " + dt.getAttribute("title") + ", " +
 dt.getText());
}

Here is the total number of email rows in the page.

        System.out.println(rows.size() + " mails.");

And finally, we are done so we quit the browser.

        driver.quit();

To recap, you can use Selenium with Google Chrome for crawling those websites that use javascript heavily. And with the Google Chrome Inspector, it is quite easy to work out the required CSS or XPath to extract from or interact with an element.

Do you have any projects that benefit from using Selenium? And what issues are you facing with it? Please describe in the comments below.