Back to Blog
How to extract thanked people from a blog
Extract thanked people from a blog using Web Transpose Crawl and AI Web Scraper
4 Dec 2023 • 1 min read • < 1 min coding time
Mike Gee
Context
Why would you want to extract the thanked people from a blog?
You can derive interesting relationships between people by extracting the thanked people from blogs.
Here's a visualization of the data for Paul Graham: PG Number.
Extracting Thanked People Using Web Transpose
Defining a Schema
First, we define a schema. We can add additional categories to just the people thanked.
schema = { 'people thanked': { 'type': 'array', 'items': { 'person name': 'string', 'reason mentioned': ['thanked for reading drafts of this blog / essay', 'other kind of praise', 'other reason'] } } }
Extraction using Web Transpose AI Web Scraper
Normally, the AI Web Scraper will generate a web scraper that can be re-used on the same website giving you minimal latency. However, in cases where there isn't a standard format, it will just extract the requested data.
Python
import webtranspose as webt os.environ['WEB_TRANSPOSE_API_KEY'] = "YOUR WEBT API KEY" scraper = webt.Scraper(schema, render_js=False) print(scraper.scrape(url))
Tailwind / JS
A Tailwind & JS SDK is currently in development. You can use the API in the meantime.
// Build AI Web Scraper const options = { method: 'POST', headers: {'X-API-Key': 'YOUR_API_KEY', 'Content-Type': 'application/json'}, body: '{"render_js":false,"schema":schema,"name":"my-scraper"}' }; const scraper_id = fetch('https://api.webtranspose.com/v1/scraper/create', options) .then(response => response.json()) .then(data => data.scraper_id); // Run AI Web Scraper const url = 'https://my-blog-url.com' const options = { method: 'POST', headers: {'X-API-Key': 'YOUR_API_KEY', 'Content-Type': 'application/json'}, body: '{"scraper_id":scraper_id,"url":url}' }; fetch('https://api.webtranspose.com/v1/scraper/scrape', options) .then(response => response.json()) .then(response => console.log(response)) .catch(err => console.error(err));
Complete
You too can contribute to this dataset on Github here.