build LC-75 web scraper

build LC-75 web scraper

task id: LC75-scraper

2024-07-09 13:44: Initial work on web scraper for LC75 #LC75-scraper #timelog:01:11:40

2024-07-09 14:00: wget doesn't work, curl doesn't work, selenium hangs #LC75-scraper

2024-07-09 14:05: trying wget again #LC75-scraper

2024-07-09 14:13: I just copy-pasted from the view source in browser #LC75-scraper

It had the data embedded in there from a JSON blob. I was able to extract it in Vim. Currently jq-ing it.

Here is my big query:

jq .props.pageProps.dehydratedState.queries[0].state.data.studyPlanV2Detail.planSubGroups data_pretty.json

2024-07-09 14:34: Picking apart LC75.json data #LC75-scraper

It's an array of items, split up by category (of which there are 22 categories). Each entry is an object with a questionNum field. When you add all the questionNums together, you get 75.

The key I want for the URL is titleSlug, and the id is questionFrontendId for the LC (I can use this to reference the problem tersely online). These are objects found in the "questions" objects, aka .[0].questions[0].

2024-07-09 14:44: Headless print on linux? #LC75-scraper

It's not going to matter.

2024-07-09 14:49: I have hit a security wall #LC75-scraper

2024-07-09 14:57: Can use wayback machine with limited success #LC75-scraper

Given a leetcode article, I can find a cached version on wayback machine, which I can then curl to an HTML file. Most of the HTML can be stripped using w3m -dump.

2024-07-09 15:27: Does wayback machine have an API for getting cached pages? #LC75-scraper #timelog:00:15:48

It does! There's a URL you can query and get a JSON response from. It does produce HTML, but, even with w3m it does seem to be kinda messy. Damn.