-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnaming-data-files-revisited.html
125 lines (119 loc) · 8.54 KB
/
naming-data-files-revisited.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title> Naming data files revisited
</title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="https://jrsmith3.github.io/theme/css/normalize.css">
<link href='//fonts.googleapis.com/css?family=Lato' rel='stylesheet' type='text/css'>
<link href='//fonts.googleapis.com/css?family=Oswald' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="https://jrsmith3.github.io/theme/css/font-awesome.min.css">
<link rel="stylesheet" href="https://jrsmith3.github.io/theme/css/main.css">
<link rel="stylesheet" href="https://jrsmith3.github.io/theme/css/blog.css">
<link rel="stylesheet" href="https://jrsmith3.github.io/theme/css/github.css">
<link href="https://jrsmith3.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Generic Surname Atom Feed" />
<script src="https://jrsmith3.github.io/theme/js/vendor/modernizr-2.6.2.min.js"></script>
</head>
<body>
<!--[if lt IE 7]>
<p class="chromeframe">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> or <a href="http://www.google.com/chromeframe/?redirect=true">activate Google Chrome Frame</a> to improve your experience.</p>
<![endif]-->
<div id="wrapper">
<header id="sidebar" class="side-shadow">
<hgroup id="site-header">
<a id="site-title" href="https://jrsmith3.github.io"><h1><i class="icon-coffee"></i> Generic Surname</h1></a>
<p id="site-desc"></p>
</hgroup>
<nav>
<ul id="nav-links">
<li><a href="https://jrsmith3.github.io/pages/about.html">About</a></li>
<li><a href="https://jrsmith3.github.io/pages/contact.html">Contact</a></li>
<li><a href="https://jrsmith3.github.io/pages/curriculum-vitae.html">Curriculum Vitae</a></li>
</ul>
</nav>
<footer id="site-info">
<p>
Proudly powered by <a href="http://getpelican.com/" target="pelican">Pelican</a> and <a href="http://python.org/" target="python">Python</a>. Theme by <a href="https://github.com/hdra/pelican-cait" target="github">hndr</a>.
</p>
<p>
Textures by <a href="http://subtlepatterns.com/" target="subtlepatterns">Subtle Pattern</a>. Font Awesome by <a href="http://fortawesome.github.io/Font-Awesome/" target="github">Dave Grandy</a>.
</p>
</footer></header>
<div id="post-container">
<ol id="post-list">
<li>
<article class="post-entry">
<header class="entry-header">
<time class="post-time" datetime="2015-03-22T00:00:00-04:00" pubdate>
Sun 22 March 2015
</time>
<a href="https://jrsmith3.github.io/naming-data-files-revisited.html" rel="bookmark"><h1>Naming data files revisited</h1></a>
</header>
<section class="post-content">
<p>I wrote a <a href="http://jrsmith3.github.io/naming-data-files.html">post about naming data files</a> awhile back. Also, yesterday I found that <a href="http://conda.pydata.org/docs/intro.html#package-naming-conventions">conda canonical filenames</a> <a href="https://github.com/conda/conda/issues/974#issuecomment-61172554">cannot support semver version strings</a> because the dash ("-") character is disallowed in the string.</p>
<p>This issue is perennial: everybody reinvents the wheel of coming up with a file naming scheme for their system. Look, the goal of a file naming convention is to balance a few competing requirements:</p>
<ul>
<li>Human legibility. People should have a fighting chance of understanding the file contents by reading the filename.</li>
<li>Uniqueness. The convention should mitigate name collisions.</li>
<li>Orthogonality and precision. Components of the naming convention should be conceptually disjoint.</li>
<li>Consistency.</li>
</ul>
<h1>Suggestion: data serialization format as filename</h1>
<p>I suggest that a general solution to the problem of coming up with file naming schemes is to employ some kind of human-readable data serialization standard like JSON, YAML, etc. for filenames. The idea is to use a string encoding a dictionary and populate the dictionary with metadata about the file. The only constraint on the type of data serilization format is constraints on filenames imposed by the filesystem.</p>
<p>JSON would be the obvious choice here, but unfortunately defining a dict requires a colon character (":") between the key and value. This character is illegal in filenames on Apple and MS Windows systems. YAML is out for the same reason plus the use of whitespace and newlines.</p>
<p>Thus, I propose using the old <a href="http://www.gnustep.org/resources/documentation/Developer/Base/Reference/NSPropertyList.html">NextSTEP plist format</a>. The fields and values should be of type string. In order to avoid invalid filesystem characters, the strings should be converted to <a href="https://en.wikipedia.org/wiki/Percent-encoding">percent encoding</a> before writing the filename.</p>
<p>Using this plist dictionary as a filename, one could imagine someone writing a plugin for <code>grep</code> that parses the filename and searches the dict.</p>
<p>Unforunately, the NextSTEP plist format is pretty old and in five minutes of searching Google, I wasn't able to find many packages to parse it. On the other hand, the NextSTEP plist format for dicts isn't terribly different from the JSON representation; replacing the "=" with ":" in the filename string should suffice to make the string parseable JSON.</p>
<h1>Examples</h1>
<p>For files containing experimental data, the filename could include the following information:</p>
<ul>
<li>date and time stamp</li>
<li>sample</li>
<li>researcher's name</li>
<li>technique</li>
</ul>
<p>Additionally, the filename could include a checksum (e.g. SHA256) to verify the integrity of the data.</p>
<p>For example, the filename <code>20110701-1453_stm_jrs0069_jrs.sm4</code> could become:</p>
<div class="highlight"><pre>{"d"="20110701T1453","e"="stm","s"="jrs0069","p"="Joshua Ryan Smith","sha256sum"="140acb7ef861ba01d74f3ee10a0a28eb80a6212157cc6ee1755bf4dfb772c6b7"}.sm4
</pre></div>
<p>where "d" stands for "date," "e" stands for "experiment type," "s" stands for "sample," "p" stands for "person," and "sha256sum" is self-explanatory.</p>
<p>The plist style filename contains more characters, but retains many of the advantages of my <a href="http://jrsmith3.github.io/naming-data-files.html">old suggestion</a>.</p>
</section>
<hr/>
<aside class="post-meta">
<p>Category: <a href="https://jrsmith3.github.io/category/blog.html">Blog</a></p>
</aside>
<hr/>
<div class="comments">
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'genericsurname';
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
<a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</article>
</li>
</ol>
</div>
</div>
<script>
var _gaq=[['_setAccount','UA-1567200-4'],['_trackPageview']];
(function(d,t){var g=d.createElement(t),s=d.getElementsByTagName(t)[0];
g.src=('https:'==location.protocol?'//ssl':'//www')+'.google-analytics.com/ga.js';
s.parentNode.insertBefore(g,s)}(document,'script'));
</script>
<script src="https://jrsmith3.github.io/theme/js/main.js"></script>
</body>
</html>