Goals:
To learn about the major data formats and practice writing scripts that generate (or reformat) data into a specific format.
Software & Technologies:
- basic data formats: csv/tsv (comma/tab-separated values), json, yml, etc.
Class:
- explanation of major formats and their importance
- hands-on: converting data into structured formats
The Essence
The ease of editing, suitability for analytical software, human-friendliness and readability, open vs. proprietary.
XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
CSV/TSV
to,from,heading,body
Tove,Jani,Reminder,Don't forget me this weekend!
JSON
{
"to": "Tove",
"from": "Jani",
"heading": "Reminder",
"body": "Don't forget me this weekend!"
}
YML
to: Tove
from: Jani
heading: Reminder
body: Don't forget me this weekend
Larger Examples
NB data example from here.
There are some online converters that can help you to convert one format into another. For example: http://www.convertcsv.com/.
CSV
/ TSV
city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
New York,4.8%,40.7127837,-74.0059413,8405837,1,New York
Los Angeles,4.8%,34.0522342,-118.2436849,3884307,2,California
Chicago,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois
Houston,11.0%,29.7604267,-95.3698028,2195914,4,Texas
Philadelphia,2.6%,39.9525839,-75.1652215,1553165,5,Pennsylvania
TSV
is a better option than a CSV
, since TAB
characters are very unlikely to appear in values.
Neither TSV
not CSV
are good for preserving new line characters (\n
)—or, in other words, text split into multiple lines. As a workaround, one can convert \n
into some unlikely-to-occur character combination (for example, ;;;
), which would allow to restore \n
later , if necessary.
JSON
[
{
"city": "New York",
"growth_from_2000_to_2013": "4.8%",
"latitude": 40.7127837,
"longitude": -74.0059413,
"population": "8405837",
"rank": "1",
"state": "New York"
},
{
"city": "Los Angeles",
"growth_from_2000_to_2013": "4.8%",
"latitude": 34.0522342,
"longitude": -118.2436849,
"population": "3884307",
"rank": "2",
"state": "California"
},
{
"city": "Chicago",
"growth_from_2000_to_2013": "-6.1%",
"latitude": 41.8781136,
"longitude": -87.6297982,
"population": "2718782",
"rank": "3",
"state": "Illinois"
},
{
"city": "Houston",
"growth_from_2000_to_2013": "11.0%",
"latitude": 29.7604267,
"longitude": -95.3698028,
"population": "2195914",
"rank": "4",
"state": "Texas"
},
{
"city": "Philadelphia",
"growth_from_2000_to_2013": "2.6%",
"latitude": 39.9525839,
"longitude": -75.1652215,
"population": "1553165",
"rank": "5",
"state": "Pennsylvania"
}
]
YML
(not a serial format)
city: New York
growth_from_2000_to_2013: 4.8%
latitude: 40.7127837
longitude: -74.0059413
population: 8405837
rank: 1
state: New York
Misc: YAML-like custom format
#ITEM###############################
city: New York
growth_from_2000_to_2013: 4.8%
latitude: 40.7127837
longitude: -74.0059413
population: 8405837
rank: 1
state: New York
#ITEM###############################
city: Los Angeles
growth_from_2000_to_2013: 4.8%
latitude: 34.0522342
longitude: -118.2436849
population: 3884307
rank: 2
state: California
#ITEM###############################
city: Chicago
growth_from_2000_to_2013: -6.1%
latitude: 41.8781136
longitude: -87.6297982
population: 2718782
rank: 3
state: Illinois
#ITEM###############################
city: Houston
growth_from_2000_to_2013: 11.0%
latitude: 29.7604267
longitude: -95.3698028
population: 2195914
rank: 4
state: Texas
#ITEM###############################
city: Philadelphia
growth_from_2000_to_2013: 2.6%
latitude: 39.9525839
longitude: -75.1652215
population: 1553165
rank: 5
state: Pennsylvania