HTML parsing with pup

5 votes

2 answers

6831 views

I'm trying to parse an HTML page with [pup](https://github.com/ericchiang/pup) . This is a command-line HTML parser and it accepts general HTML selectors. I know I can use Python which I do have installed on my machine, but I'd like to learn how to use pup just to get practice with the command-line. The website I want to scrape from is https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1 I created an html file:

curl https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1  > fbi2018.html

How do I extract out a column of data, such as 'Population'? This is the command I originally wrote:

cat fbi2018.html | grep -A1 'cell31 ' | grep -v 'cell31 ' | sed 's/text-align: right;//' | sed 's///' | sed 's/--//' | sed '/^[[:space:]]*$/d' | sort -nk1,1

It actually works but it's an ugly, hacky way to do it, which is why I want to use pup. I noticed that all of the values I need from the column 'Population' have headers="cell 31 .." somewhere within the `` tag. For example:

323,405,935

I want to extract all the values that have this particular header in its ` tag, which in this particular example, would be 323,405,935` It seems that multiple selectors in pup doesn't work, however. So far, I can select all the td elements:

cat fbi2018.html | pup 'td'

But I don't know how to select headers that contain a particular query. **EDIT:** The output should be:

272,690,813
281,421,906
285,317,559
287,973,924
290,788,976
293,656,842
296,507,061
299,398,484
301,621,157
304,059,724
307,006,550
309,330,219
311,587,816
313,873,685
316,497,531
318,907,401
320,896,618
323,405,935
325,147,121
327,167,434

Asked by rplee (377 rep)

May 29, 2020, 05:39 PM
Last activity: Oct 20, 2023, 11:44 AM

HTML parsing with pup

Related Questions