Automate Website Actions With Web Scraping & Rowy (2023): A Simple Guide

Olivia Rhye
11 Jan 2022
5 min read

Browser automation uses software to automate website actions like retrieving information or filling out forms on a website―saving hours on a weekly basis in employee productivity.

Say you go on a website like Craiglist and want to receive an email every time a new post matches your budget: website automation can do that.

With a tool like Rowy, it doesn't take much technical knowledge to do that either: in this article, we will explore how to use Rowy to automate actions on a website, including importing data with basic web scraping, parsing HTML data, pre-filling forms with OpenAI, and extracting articles with Rowy Actions.

1. Downloading Webpages With Fetch

Web scraping automates data retrieval from a website. There are mainly two ways to go about it―using basic regexes to parse the textual representation of a web page in HTML or by manipulating the Document Object Model tree directly using a more complex tool like Puppeteer―but both require downloading HTML first. In this article, we'll use the regex method to perform basic information retrieval tasks and pre-fill forms.

With Rowy, you can use basic HTTP fetch queries to import HTML from an URL into its spreadsheet interface. fetch is a built-in JavaScript function to make network requests, like retrieving the content of an HTML page:

const res = await fetch('https://www.rowy.io/blog')
const html = await res.text()
console.log(html)

We use fetch to make an HTTP request to an URL. In this case, the Response object will contain the Rowy blog webpage as an HTML string. We can then parse this HTML to retrieve the information we need, like the list of the latest blog posts for example.

Sometimes, it can be hard to figure out which URL to call to get the HTML content: use your browser's dev tools to inspect the network requests made by a website and find the URL you need from the 'Response' tab. In Chrome, you can open the developer tools by pressing F12 or by right-clicking on a page and selecting "Inspect". You can then look for HTML responses and find the URL you need:

Chrome dev tools - 0.jpg

2. Importing Data In Rowy

After creating a new Rowy table, use an Action column to fetch a webpage and store the HTML output. An Action column is a special column that allows you to run your own JavaScript code, so we can copy / paste the code above into the Action column to import the HTML from a website:

Rowy Action column with a script to import HTML - 1.jpg

We use a second column named html to store the HTML output, and a third url column to input the URL we want to import.

This is the result with 3 rows of data after running the Action column:

Rowy table with 3 rows of data - 2.jpg

Now that we have our raw HTML, we can use a regex to extract the parts we need.

3. Parsing HTML data

We can use a derivative column to parse the HTML data. A derivative column is a special column that allows you to run your own JavaScript code on a given row in another column. In this case, we will use a derivative column to extract the information we need like text, images or links to save time and effort.

Parsing HTML with JavaScript can be done using a regex. A regex is a pattern that describes a set of expressions. For example, the regex /Rowy/g will match every instance of the word Rowy:

var htmlString = row.html
var formRegex = /Rowy/g;
var formMatch = htmlString.match(formRegex);
console.log(`Rowy is mentionned ${formMatch.length} times`)

We can then put this code in the derivative column to store the number of times the word Rowy is mentionned in the HTML:

Rowy column with occurences of the word Rowy - 4.jpg

Writing regex can be tricky, but there are many tools online to help you write them. For example, you can use regex101 to test your regex and ChatGPT to write them.

Similarly, reading HTML can be difficult but you can use the browser's dev tools again to inspect the HTML of a page. If you want to extract the title of a Rowy article for example, you would use the dev tools to find the HTML branch that contains the title and use it to write your regex:

Chrome webpage inspector - 3.jpg

In this screenshot, we notice that all article titles are in a h1 tag, so we'll just need a regex to match the h1 tag:

var htmlString = row.html
var formRegex = /<h1.*?>(.*?)<\/h1>/g;
var formMatch = htmlString.match(formRegex);
console.log(`Title: ${formMatch[0]}`)

4. Example: Extracting Articles

Let's imagine a more useful example where you'd fetch articles from several websites to read them in a distraction-free interface like Rowy's. You'd put the article's link in the table and Rowy would take care of extracting the content. We can use the same technique:

// The HTML string containing the article
var htmlString = row.html

// Use a regular expression to match the article element
var articleRegex = /<article.*?>(.*?)<\/article>/g;
var articleMatch = htmlString.match(articleRegex);

if (articleMatch) {
    var articleContent = articleMatch[1];
    console.log("Article Content: " + articleContent);
}

Well-designed websites encapsulate their HTML content in semantic elements like article, header, footer, etc. For this example, we look for the article element to extract the content of the article. We can then use a derivative column to store the article content as rich text, and the extracted content will automatically be displayed in the Rowy interface, completely clutter-free:

Rowy article as rich text cell - 5.jpg

Join Our Rowy Community

In conclusion, browser automation is an essential tool for businesses and individuals who frequently interact with web-based applications. With Rowy, you can automate repetitive tasks like importing data with basic web scraping, parse HTML data, and extract articles to save time and improve your productivity.

Check out our Discord Community to learn more about using Rowy for your use case and connect with other users.

Olivia Rhye
11 Jan 2022
5 min read
Get started with Rowy in minutes

Continue reading

Browse all