I'm trying to parse an HTML page with [pup](https://github.com/ericchiang/pup) .
This is a command-line HTML parser and it accepts general HTML selectors. I know I can use Python which I do have installed on my machine, but I'd like to learn how to use pup just to get practice with the command-line.
The website I want to scrape from is
https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1
I created an html file:
curl https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1 > fbi2018.html
How do I extract out a column of data, such as 'Population'?
This is the command I originally wrote:
cat fbi2018.html | grep -A1 'cell31 ' | grep -v 'cell31 ' | sed 's/text-align: right;//' | sed 's///' | sed 's/--//' | sed '/^[[:space:]]*$/d' | sort -nk1,1
It actually works but it's an ugly, hacky way to do it, which is why I want to use pup. I noticed that all of the values I need from the column 'Population' have headers="cell 31 .."
somewhere within the `` tag. For example:
323,405,935
I want to extract all the values that have this particular header in its ` tag, which in this particular example, would be
323,405,935`
It seems that multiple selectors in pup doesn't work, however. So far, I can select all the td elements:
cat fbi2018.html | pup 'td'
But I don't know how to select headers that contain a particular query.
**EDIT:**
The output should be:
272,690,813
281,421,906
285,317,559
287,973,924
290,788,976
293,656,842
296,507,061
299,398,484
301,621,157
304,059,724
307,006,550
309,330,219
311,587,816
313,873,685
316,497,531
318,907,401
320,896,618
323,405,935
325,147,121
327,167,434
Asked by rplee
(377 rep)
May 29, 2020, 05:39 PM
Last activity: Oct 20, 2023, 11:44 AM
Last activity: Oct 20, 2023, 11:44 AM