build LC-75 web scraper
task id: LC75-scraper2024-07-09 13:44: Initial work on web scraper for LC75 #LC75-scraper #timelog:01:11:40
2024-07-09 14:00: wget doesn't work, curl doesn't work, selenium hangs #LC75-scraper
2024-07-09 14:05: trying wget again #LC75-scraper
2024-07-09 14:13: I just copy-pasted from the view source in browser #LC75-scraper
2024-07-09 14:34: Picking apart LC75.json data #LC75-scraper
It's an array of items, split up by category (of which there are 22 categories). Each entry is an object with a questionNum
field. When you add all the questionNum
s together, you get 75.
The key I want for the URL is titleSlug
, and the id is questionFrontendId
for the LC (I can use this to reference the problem tersely online). These are objects found in the "questions" objects, aka .[0].questions[0]
.
2024-07-09 14:44: Headless print on linux? #LC75-scraper
It's not going to matter.
2024-07-09 14:49: I have hit a security wall #LC75-scraper
2024-07-09 14:57: Can use wayback machine with limited success #LC75-scraper
Given a leetcode article, I can find a cached version on wayback machine, which I can then curl to an HTML file. Most of the HTML can be stripped using w3m -dump
.
2024-07-09 15:27: Does wayback machine have an API for getting cached pages? #LC75-scraper #timelog:00:15:48
It does! There's a URL you can query and get a JSON response from. It does produce HTML, but, even with w3m it does seem to be kinda messy. Damn.
It had the data embedded in there from a JSON blob. I was able to extract it in Vim. Currently jq-ing it.
Here is my big query:
= #+BEGINpretty.json #+END_SRC
=