Back to Blog

How to extract thanked people from a blog

Extract thanked people from a blog using Web Transpose Crawl and AI Web Scraper

4 Dec 2023 • 1 min read • < 1 min coding time

Mike Gee

Link to Code

Context

Why would you want to extract the thanked people from a blog?

You can derive interesting relationships between people by extracting the thanked people from blogs.

Here's a visualization of the data for Paul Graham: PG Number.

Extracting Thanked People Using Web Transpose

Defining a Schema

First, we define a schema. We can add additional categories to just the people thanked.

schema = {
    'people thanked': {
        'type': 'array',
        'items': {
            'person name': 'string',
            'reason mentioned': ['thanked for reading drafts of this blog / essay', 'other kind of praise', 'other reason']
        }
    }
}

Extraction using Web Transpose AI Web Scraper

Normally, the AI Web Scraper will generate a web scraper that can be re-used on the same website giving you minimal latency. However, in cases where there isn't a standard format, it will just extract the requested data.

Python

import webtranspose as webt
os.environ['WEB_TRANSPOSE_API_KEY'] = "YOUR WEBT API KEY"

scraper = webt.Scraper(schema, render_js=False)
print(scraper.scrape(url))

Tailwind / JS

A Tailwind & JS SDK is currently in development. You can use the API in the meantime.

// Build AI Web Scraper
const options = {
  method: 'POST',
  headers: {'X-API-Key': 'YOUR_API_KEY', 'Content-Type': 'application/json'},
  body: '{"render_js":false,"schema":schema,"name":"my-scraper"}'
};

const scraper_id = fetch('https://api.webtranspose.com/v1/scraper/create', options)
  .then(response => response.json())
  .then(data => data.scraper_id);


// Run AI Web Scraper
const url = 'https://my-blog-url.com'
const options = {
  method: 'POST',
  headers: {'X-API-Key': 'YOUR_API_KEY', 'Content-Type': 'application/json'},
  body: '{"scraper_id":scraper_id,"url":url}'
};

fetch('https://api.webtranspose.com/v1/scraper/scrape', options)
  .then(response => response.json())
  .then(response => console.log(response))
  .catch(err => console.error(err));

Complete

You too can contribute to this dataset on Github here.