WIP: q, xsv, graphext, dirtylittlesql, Tools for handling large CSV files from Simon Willison's mega replied to thread; CSV the universal data structure
Probably wrong thought :-)
CSV Tools Summary Table
-
New to me :-) CSV Tools (except for Tad which was documented yesterday) from the 1000 replies to Simon Willison’s tweet: If someone gives you a CSV file with 100,000 rows in it, what tools do you use to start exploring and understanding that data?
-
The github Markdown tables are much nicer! Until I get github flavoured markdown working here, check out the tables in github which have thin lines around the table cells as they should!
Name Tweet Type Notes Postgres [1], [2] Database * Use COPY command, CSV data wrapper which lets you do CSV operations without even copying them which is good for folders awk, cut, wc, grep, args, etc [1] classic Unix CLI text tools * xargs & many classic tools don’t work with embedded commas e.g. 1, "1,2,3", 2
* They also don’t work with unicode and other non classic ASCII charactersq Tableau [1] proprietary * for plotting xsv [1] cli tool * xsv and csvkit, to get a sense of the size and cardinality and consistency of the data before I bother opening it up in Excel or sqlite
csvkit pandas in Python [1] * thread from Gus about a 5.7GB CSV file (read_csv in batches) visidata dirtylittlesql graphex Julia and Duck MySQL and phpMyAdmin [1] * from Carl Henderson who of course knows MySQL and PHP from flickr and many other things :-) Excel, Access, etc * Of course, not going to give any further details for these proprietary tools, left as an exercise to the reader :-) Spotfire, Tableau and other BI Tools [1] * Again not likely to use these tools :-) dask for Python [1] python library * How are there no mentions of dask? Like pandas but not shitty. Loads the data in chunks, doesn’t have to load it all into memory. Dask is the answer for all of them. [https://docs.dask.org/en/stable/](https://t.co/qB6GthUdco) If I can’t use python then redis on a high mem ec2 instance to do what i need and shutdown
Programming language and other small code snippets
Language/Tool | Tweet | Snippet | Notes |
---|---|---|---|
Datasette on Mac / CLI on Windows | # | sqlite-utils insert /tmp/data.db rows big.csv --csv datasette /tmp/data.db or open CSV in Datasette Mac Desktop app |
|
Perl | # | * Use Parse::CSV rather than splitting on commas. |
|
Perl | # | Not sure what this does :-) : perl -ne 'print if rand() < 1e-3 filename.csv |
|
Postgres | # | export DATABASE_URL="postgres://postgres:postgres@localhost:5432/postgres" rows pgimport --schema=:text: --input-encoding=utf-8 --dialect=excel myfile.csv[.gz |
.xz |