Probably wrong thought :-)

CSV is the universal data structure. Excel, Google Sheets & all those tools mentioned allow EVERYBODY of all skill levels, software developers or not, to do what they want from simple analysis to full on visualization which is super great! #ProgrammingAndDataScienceForEveryBodyhttps://twitter.com/hashtag/ProgrammingAndDataScienceForEveryBody

CSV Tools Summary Table

  • New to me :-) CSV Tools (except for Tad which was documented yesterday) from the 1000 replies to Simon Willison’s tweet: If someone gives you a CSV file with 100,000 rows in it, what tools do you use to start exploring and understanding that data?

  • The github Markdown tables are much nicer! Until I get github flavoured markdown working here, check out the tables in github which have thin lines around the table cells as they should!

    Name Tweet Type Notes
    Postgres [1], [2] Database * Use COPY command, CSV data wrapper which lets you do CSV operations without even copying them which is good for folders
    awk, cut, wc, grep, args, etc [1] classic Unix CLI text tools * xargs & many classic tools don’t work with embedded commas e.g. 1, "1,2,3", 2
    * They also don’t work with unicode and other non classic ASCII characters
    q      
    Tableau [1] proprietary * for plotting
    xsv [1] cli tool * xsv and csvkit, to get a sense of the size and cardinality and consistency of the data before I bother opening it up in Excel or sqlite
    csvkit      
    pandas in Python [1]   * thread from Gus about a 5.7GB CSV file (read_csv in batches)
    visidata      
    dirtylittlesql      
    graphex      
    Julia and Duck      
    MySQL and phpMyAdmin [1]   * from Carl Henderson who of course knows MySQL and PHP from flickr and many other things :-)
    Excel, Access, etc     * Of course, not going to give any further details for these proprietary tools, left as an exercise to the reader :-)
    Spotfire, Tableau and other BI Tools [1]   * Again not likely to use these tools :-)
    dask for Python [1] python library * How are there no mentions of dask? Like pandas but not shitty. Loads the data in chunks, doesn’t have to load it all into memory. Dask is the answer for all of them. [https://docs.dask.org/en/stable/](https://t.co/qB6GthUdco) If I can’t use python then redis on a high mem ec2 instance to do what i need and shutdown

Programming language and other small code snippets

Language/Tool Tweet Snippet Notes
Datasette on Mac / CLI on Windows # sqlite-utils insert /tmp/data.db rows big.csv --csv datasette /tmp/data.db or open CSV in Datasette Mac Desktop app  
Perl # * Use Parse::CSV rather than splitting on commas.  
Perl # Not sure what this does :-) : perl -ne 'print if rand() < 1e-3 filename.csv  
Postgres # export DATABASE_URL="postgres://postgres:postgres@localhost:5432/postgres" rows pgimport --schema=:text: --input-encoding=utf-8 --dialect=excel myfile.csv[.gz .xz

Previously

Leave a comment on github