How to find old threads in outbox.json from your Mastodon archive using jq (lo-tech solution is better again i.e. always faster to edit and search using emacs or visual code)
-
UPDATE 3: even better output a properly formed json file i.e. an array :-) by adding
[]
around the query. The resulting json can then be fed to datasette lite to allow searching and manipulation:- super long url that loads array_filtered_id_content.json into lite.datasette.io
jq '[.orderedItems | .[] | {id: .object.id? | select (. !=null), content: .object.content? | select(. != null)}]' outbox.json > array_filtered_id_content.json
- super long url that loads array_filtered_id_content.json into lite.datasette.io
-
UPDATE 2: how output a json file! output is in filtered_id_content.json
cd 2024-05-04-devdilettante-com-archive-20240505032936-7d3110aafe1a03f80e3db147600edd38/ jq '.orderedItems | .[] | {id: .object.id? | select (. !=null), content: .object.content? | select(. != null)}' outbox.json > filtered_id_content.json
- UPDATE1: how to display the
id
field so you can see the link to the toot:jq '.orderedItems | .[] | .object.id + " " + .object.content? | select(. != null)' /Users/roland/Documents/DEV_DILETTANTE_COM_MASTODON_BACKUPS/2024-05-04-devdilettante-com-archive-20240505032936-7d3110aafe1a03f80e3db147600edd38/outbox.json | grep wget
- which results in:
"https://devdilettante.com/users/roland/statuses/112341435858857502 <p>`wget --execute="robots = off" --spider --force-html -r -l 0 $url 2>&1 | grep -e '^--' | grep -e '\\.\\(jpeg\\|mp3\\|png\\|gif\\|jpg\\) 2>stderr_mp3_jpg_png_urls.txt>_level_0_mp3_jpg_png_urls.txt` appears to get a list of URLs from a WordPress (or any?) website (via <a href=\"https://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" translate=\"no\"><span class=\"invisible\">https://</span><span class=\"ellipsis\">stackoverflow.com/questions/28</span><span class=\"invisible\">04467/spider-a-website-and-return-urls-only</span></a> )</p>" "https://devdilettante.com/users/roland/statuses/112343480523962150 <p>`cat _level_0_mp3_jpg_png_urls.txt | sed "s/-- / /g" | awk '{ print $3 }' | grep _photos | > photo_urls.txt ; mkdir PHOTOS; cd !$ ; cat ../photo_urls.txt | xargs -n 1 wget `</p>" "https://devdilettante.com/users/roland/statuses/112343502318371811 <p>replace `'` with '`'` to keep bash and wget happy :-) (first `'` occurs at line 410</p>" "https://devdilettante.com/users/roland/statuses/112343520971736534 <p>`tail --lines=+411 ../photo_urls.txt| xargs -n 1 wget`</p>" "https://devdilettante.com/users/roland/statuses/112346164952219195 <p>I will work on potential issue with commas and colons after i see how well httrack works.<br />Seems to be much better than wget.<br />Here's the magic :-) incantation:<br />`httrack <a href=\"http://www.barbandroland.com\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" translate=\"no\"><span class=\"invisible\">http://www.</span><span class=\"\">barbandroland.com</span><span class=\"invisible\"></span></a> -W -O "/Users/roland/Documents/BARB_AND_ROLAND_DOT_COM_BACKUPS/HTTRACK_BACKUP/barbandrolandbackup" -%v -s0 +www.barbandroland.com/*.jpg`</p>"
- From mastodon: i couldn’t find my
wget
thread from April 26th so I searched my Mastodon archive as follows to get it :-) :
jq '.orderedItems | .[] | .object.content? | \
select(. != null)' \
/Users/roland/Documents/DEV_DILETTANTE_COM_MASTODON_BACKUPS/2024-05-04-devdilettante-com-archive-20240505032936-7d3110aafe1a03f80e3db147600edd38/outbox.json \
| grep wget
- so much yakshaving. it was actually faster to figure this out using Visual Code or emacs :-) on
outbox.json
and searching forwget
butjq
is a fun challenge for some value offun
:-) ?!?!? - the aforementioned thread from April 2026, 2024 is here: https://devdilettante.com/@roland/112341435858857502
Previously
- June 20, 2017: How to minify JSON using jq