[Web Scraper] - Welltrack Boost

by ADMIN 32 views

Problem

In this article, we will explore the process of creating a web scraper to populate the Resources page of Vera with real deployment resources. We will create a POST request to scrape the provided URL and populate the Resources page. The goal is to have a set of web scrapers for specific pages that run periodically to ensure the resource information is up-to-date.

The Idea Behind Web Scraping

Web scraping is a technique used to extract data from websites. In this case, we want to create a web scraper that can extract information from a specific URL and populate the Resources page of Vera. We will use the axios library to make a POST request to the website and extract the required information.

Task

To complete this task, we need to follow these steps:

Step 1: Create a POST Request in backend/routes/web_scrapping.js

We need to create a POST request in backend/routes/web_scrapping.js to scrape the provided URL. Since each web scraper is unique to the URL page, we suggest keeping the entire web scraping code within the POST request.

const express = require('express');
const router = express.Router();
const axios = require('axios');
const cheerio = require('cheerio');

router.post('/web-scraping', async (req, res) => {
  try {
    const url = 'https://chw.calpoly.edu/counseling/apps';
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);
    const title = $('title').text();
    const imageUrl = $('img').attr('src');
    const imageAltText = $('img').attr('alt');
    const address = $('address').text();
    const buildingName = $('building-name').text();
    const paragraphText = $('p').text();
    const phoneNumber = $('phone-number').text();
    const resourceUrl = $('resource-url').text();
    const lastUpdate = new Date().toISOString();
    const category = ['Self-Help'];
    const listOfHours = [];
    const extraInfo = ['https://chw.calpoly.edu/counseling/apps'];

    // Extract hours from the website
    const hours = $('hours').text();
    listOfHours.push(hours);

    // Create a new resource
    const newResource = {
      Title: title,
      ImageURL: imageUrl,
      ImageAltText: imageAltText,
      Address: address,
      BuildingName: buildingName,
      ParagraphText: paragraphText,
      PhoneNumber: phoneNumber,
      ResourceURL: resourceUrl,
      LastUpdate: lastUpdate,
      Category: category,
      ListOfHours: listOfHours,
      ExtraInfo: extraInfo,
    };

    // Save the new resource to MongoDB
    const mongoose = require('mongoose');
    const IndividualResources = mongoose.model('IndividualResources');
    const newResourceDoc = new IndividualResources(newResource);
    await newResourceDoc.save();

    res.json({ message: 'New resource created successfully' });
  } catch (error) {
    console.error(error);
    res.status(500).json({ message: 'Error creating new resource' });
  }
});

module.exports = router;

Step 2: Find as Much Information as Possible for the IndResSchema

We need to find as much information as possible for the IndResSchema in /backend/models/IndividualResources.js. We will use the mongoose library to define the schema.

const mongoose = require('mongoose');

const individualResourcesSchema = new mongoose.Schema({
  Title: String,
  ImageURL: String,
  ImageAltText: String,
  Address: String,
  BuildingName: String,
  ParagraphText: String,
  PhoneNumber: String,
  ResourceURL: String,
  LastUpdate: Date,
  Category: [String],
  ListOfHours: [String],
  ExtraInfo: [String],
});

const IndividualResources = mongoose.model('IndividualResources', individualResourcesSchema);

module.exports = IndividualResources;

Step 3: Populate the Required Fields

We need to populate the required fields for the IndResSchema. We will use the axios library to make a POST request to the website and extract the required information.

const express = require('express');
const router = express.Router();
const axios = require('axios');
const cheerio = require('cheerio');

router.post('/web-scraping', async (req, res) => {
  try {
    const url = 'https://chw.calpoly.edu/counseling/apps';
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);
    const title = $('title').text();
    const imageUrl = $('img').attr('src');
    const imageAltText = $('img').attr('alt');
    const address = $('address').text();
    const buildingName = $('building-name').text();
    const paragraphText = $('p').text();
    const phoneNumber = $('phone-number').text();
    const resourceUrl = $('resource-url').text();
    const lastUpdate = new Date().toISOString();
    const category = ['Self-Help'];
    const listOfHours = [];
    const extraInfo = ['https://chw.calpoly.edu/counseling/apps'];

    // Extract hours from the website
    const hours = $('hours').text();
    listOfHours.push(hours);

    // Create a new resource
    const newResource = {
      Title: title,
      ImageURL: imageUrl,
      ImageAltText: imageAltText,
      Address: address,
      BuildingName: buildingName,
      ParagraphText: paragraphText,
      PhoneNumber: phoneNumber,
      ResourceURL: resourceUrl,
      LastUpdate: lastUpdate,
      Category: category,
      ListOfHours: listOfHours,
      ExtraInfo: extraInfo,
    };

    // Save the new resource to MongoDB
    const mongoose = require('mongoose');
    const IndividualResources = mongoose.model('IndividualResources');
    const newResourceDoc = new IndividualResources(newResource);
    await newResourceDoc.save();

    res.json({ message: 'New resource created successfully' });
  } catch (error) {
    console.error(error);
    res.status(500).json({ message: 'Error creating new resource' });
  }
});

module.exports = router;

Step 4: Use Postman to Manually Make a POST Request

We need to use Postman to manually make a POST request to the /web-scraping endpoint. We will verify that the new resource appears in MongoDB.

Step 5: Copy the New Individual Resource ObjectId and Add it to the "Dev-Resources" Document

We need to copy the new individual resource ObjectId and add it to the "Dev-Resources" document under "general-resource-category" in MongoDB. We will verify that the individual resource appears on the website.

Notes

We will use the router.get('/colleges-and-majors'... in backend/routes/stories.js API request as a reference for using axios and cheerio for web scraping. We will note that this specific function is a GET request, but we want a POST request. We will use other POST requests in the same page (stories.js) as a reference if we are unfamiliar with making POST requests in Express.

Conclusion

Frequently Asked Questions

In this article, we will answer some frequently asked questions about the web scraper - Welltrack Boost.

Q: What is web scraping?

A: Web scraping is a technique used to extract data from websites. In this case, we are using web scraping to extract information from the Welltrack Boost website and populate the Resources page of Vera.

Q: Why do we need to create a web scraper?

A: We need to create a web scraper to populate the Resources page of Vera with real deployment resources. This will ensure that the resource information is up-to-date and accurate.

Q: What is the purpose of the web scraper?

A: The purpose of the web scraper is to extract information from the Welltrack Boost website and populate the Resources page of Vera. This will include extracting the title, image URL, image alt text, address, building name, paragraph text, phone number, resource URL, last update, category, list of hours, and extra information.

Q: How do we create a web scraper?

A: To create a web scraper, we need to use a library such as axios to make a POST request to the website and extract the required information. We also need to use a library such as cheerio to parse the HTML of the website and extract the required information.

Q: What is the difference between a GET and POST request?

A: A GET request is used to retrieve data from a website, while a POST request is used to send data to a website. In this case, we are using a POST request to send data to the website and extract the required information.

Q: How do we verify that the new resource appears in MongoDB?

A: To verify that the new resource appears in MongoDB, we need to use a tool such as MongoDB Compass to check the database. We can also use a tool such as Postman to manually make a POST request to the /web-scraping endpoint and verify that the new resource appears in the response.

Q: How do we copy the new individual resource ObjectId and add it to the "Dev-Resources" document?

A: To copy the new individual resource ObjectId and add it to the "Dev-Resources" document, we need to use a tool such as MongoDB Compass to find the ObjectId of the new resource and add it to the "Dev-Resources" document.

Q: What is the purpose of the "Dev-Resources" document?

A: The purpose of the "Dev-Resources" document is to store the ObjectIds of the individual resources that have been created. This will allow us to easily access and manage the individual resources.

Q: How do we verify that the individual resource appears on the website?

A: To verify that the individual resource appears on the website, we need to use a tool such as a web browser to check the website. We can also use a tool such as Postman to manually make a GET request to the website and verify that the individual resource appears in the response.

Conclusion

In this article, we have answered some frequently asked questions about the web scraper - Welltrack Boost. We have covered topics such as web scraping, creating a web scraper, verifying that the new resource appears in MongoDB, copying the new individual resource ObjectId and adding it to the "Dev-Resources" document, and verifying that the individual resource appears on the website.