Goals:

To learn about the major data formats and practice writing scripts that generate (or reformat) data into a specific format.

Software & Technologies:

  • basic data formats: csv/tsv (comma/tab-separated values), json, yml, etc.

Class:

  • explanation of major formats and their importance
  • hands-on: converting data into structured formats

The Essence

The ease of editing, suitability for analytical software, human-friendliness and readability, open vs. proprietary.

XML

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

CSV/TSV

to,from,heading,body
Tove,Jani,Reminder,Don't forget me this weekend!

JSON

{
  "to": "Tove",
  "from": "Jani",
  "heading": "Reminder",
  "body": "Don't forget me this weekend!"
}

YML

to: Tove
from: Jani
heading: Reminder
body: Don't forget me this weekend

Larger Examples

NB data example from here.

There are some online converters that can help you to convert one format into another. For example: http://www.convertcsv.com/.

CSV / TSV

city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
New York,4.8%,40.7127837,-74.0059413,8405837,1,New York
Los Angeles,4.8%,34.0522342,-118.2436849,3884307,2,California
Chicago,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois
Houston,11.0%,29.7604267,-95.3698028,2195914,4,Texas
Philadelphia,2.6%,39.9525839,-75.1652215,1553165,5,Pennsylvania

TSV is a better option than a CSV, since TAB characters are very unlikely to appear in values.

Neither TSV not CSV are good for preserving new line characters (\n)—or, in other words, text split into multiple lines. As a workaround, one can convert \n into some unlikely-to-occur character combination (for example, ;;;), which would allow to restore \n later , if necessary.

JSON

[
    {
        "city": "New York", 
        "growth_from_2000_to_2013": "4.8%", 
        "latitude": 40.7127837, 
        "longitude": -74.0059413, 
        "population": "8405837", 
        "rank": "1", 
        "state": "New York"
    }, 
    {
        "city": "Los Angeles", 
        "growth_from_2000_to_2013": "4.8%", 
        "latitude": 34.0522342, 
        "longitude": -118.2436849, 
        "population": "3884307", 
        "rank": "2", 
        "state": "California"
    }, 
    {
        "city": "Chicago", 
        "growth_from_2000_to_2013": "-6.1%", 
        "latitude": 41.8781136, 
        "longitude": -87.6297982, 
        "population": "2718782", 
        "rank": "3", 
        "state": "Illinois"
    }, 
    {
        "city": "Houston", 
        "growth_from_2000_to_2013": "11.0%", 
        "latitude": 29.7604267, 
        "longitude": -95.3698028, 
        "population": "2195914", 
        "rank": "4", 
        "state": "Texas"
    }, 
    {
        "city": "Philadelphia", 
        "growth_from_2000_to_2013": "2.6%", 
        "latitude": 39.9525839, 
        "longitude": -75.1652215, 
        "population": "1553165", 
        "rank": "5", 
        "state": "Pennsylvania"
    }
]

YML (not a serial format)


city: New York 
growth_from_2000_to_2013: 4.8% 
latitude: 40.7127837 
longitude: -74.0059413
population: 8405837 
rank: 1 
state: New York

Misc: YAML-like custom format

#ITEM###############################
city: New York 
growth_from_2000_to_2013: 4.8% 
latitude: 40.7127837 
longitude: -74.0059413 
population: 8405837 
rank: 1 
state: New York 
#ITEM###############################
city: Los Angeles 
growth_from_2000_to_2013: 4.8% 
latitude: 34.0522342 
longitude: -118.2436849 
population: 3884307 
rank: 2 
state: California
#ITEM###############################
city: Chicago 
growth_from_2000_to_2013: -6.1% 
latitude: 41.8781136 
longitude: -87.6297982 
population: 2718782 
rank: 3 
state: Illinois
#ITEM###############################
city: Houston 
growth_from_2000_to_2013: 11.0% 
latitude: 29.7604267 
longitude: -95.3698028 
population: 2195914 
rank: 4 
state: Texas
#ITEM###############################
city: Philadelphia 
growth_from_2000_to_2013: 2.6% 
latitude: 39.9525839 
longitude: -75.1652215 
population: 1553165 
rank: 5 
state: Pennsylvania