Licensing Hanzi + Data sets #18

tony · 2013-12-14T00:34:45Z

Greetings @nieldlr !

What is the license with Hanzi? I prefer MIT, BSD and Apache.

Separately, what is the copyright, licensing and source of the data sets @ https://github.com/nieldlr/Hanzi/tree/master/lib/data?

nieldlr · 2013-12-15T03:46:17Z

Hello @tony,

thanks so much for bringing this up. The license was long overdue. I've updated the readme and added an MIT license.

For the data it gets a bit more trickier. You can see I listed all of the ones that were not created by me or generated by HanziJS. Most of the data will be okay with commercial use, as their licenses allow it, if you want to use it in that way, however, it depends on how you use it. For example, I contacted the creators of the Leiden University Word Frequency Corpus and they seemed happy to let me use it in HanziCraft.com as long I don't sell the actual data.

What are your building? I'm very curious! Would love add it to the projects that use HanziJS list.

tony · 2013-12-15T07:28:00Z

@nieldlr I'm happy to meet you.

Hanzi looks great.

I think it would be appropriate that LICENSES of the data set be in documentation. Since Hanzi is a distribution, the license of the dataset(s) should be official.

I kind of want to act as a missionary right now:

ODC / Open Data Commons Attribution License (ODC-By) v1.0 - http://opendatacommons.org/licenses/by/summary/ - http://opendatacommons.org/licenses/by/1.0/ - Simple, guarantees attribution.

as long I don't sell the actual data

I am following the idea of http://okfn.org/opendata/. IANAL, but from what I understand - this would technically allow the data to be sold wholesale.. I understand where the owner of a dataset would have concern.

However, in real life, we could just burn FreeBSD CD's and hang out on street corners all day like Jay and Silent Bob. I think that moving datasets to Open Data Commons format seems nice, since ultimately, it's about what the implementer does with it.

In practice, it protects the real interests of the data provider. Giving attribution to the original provider, and assuring the public data can be used by the world.

I am in the process of a similar project for python. I am wrangling together cjk datasets and trying to get them under MIT/ODC licenses. cburgmer/cjklib#6.

When it's the right time, I will chime in with what I'm working on. Would you be interested in some sort of a collaboration to cover the rest useful hanzi data sets? I can help fill in / PR datasets you are missing, maybe you / we can contact the data providers and see if they can release theirs under ODC?

Edit: I noticed the LICENSE in the README linking to http://lwc.daanvanesch.nl/legal.php, http://lingua.mtsu.edu/chinese-computing/copyright.html. I think for these two providers, it'd be preferable to see them under ODC/CC-type license. BTW, The chinese character decomposition by Gavin is ODC now.

nieldlr · 2013-12-15T10:14:34Z

Hey @tony,

this is great! Thanks so much for spearheading this.

I've had a similar idea to open source data collections for learning materials for learners and developers in the past. I really like the open data commons. First time I've seen it!

I'll release my data under ODC. In fact, I'll clean up the directory a bit to clear out data not being used at present/anymore. I'll then state which ones are mine, released under ODC and then list the other datasets as well.

It's great to see that Gavin has released his under ODC. I'll contact the other two and update you.

Just one question, I had a look at the licenses, I really like this one a bit more: http://opendatacommons.org/licenses/odbl/summary/
It restricts nefarious uses of the data for me. It would allow others to open up their data as well. Even if it's not immediately clear that it can accessed in that way, I would make me sleep better at night. :)

Thanks again for this. I really like these ideas!

tony · 2013-12-16T01:09:07Z

@nieldlr: I am ok with ODBL.

In practice, assurances granted in writing are common sense, but in open source, permissiveness tends to be the best. I go into it here ScottDuckworth/python-anyvcs#32 (comment). GPL is a hardcore example of compliance. My point I go into is, in practice, it's in a self-interested best interest to PR / patch upstream. It takes additional energy and time to maintain a fork.

At worst, there is someone with a DRM'd version of a library who's wasting his time keeping up the sync with the main dataset. The common moocher isn't the type who is going to be holding back a useful patch anyway.

If there's an off shoot chance a genius grabs the data set and creates a better derivative, good for them. It may be in their best interest to patch back to the original for 1.) glory 2.) avoid having to synchronize the diff. They still are required to give attribution.

I'm fine with ODBL and ODC. I prefer ODC.

tony · 2014-01-19T02:04:02Z

@nieldlr: https://github.com/tony/cihai/tree/master/cihai/datasets/unihan that is worth keeping an eye on, I have a README there that outlines some standards I found for the cjkdata. I will probably separate it into a different repo soon.

Edit: Now https://github.com/cihai, https://github.com/cihai/cihai-handbook, https://github.com/cihai/cihaidata-python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licensing Hanzi + Data sets #18

Licensing Hanzi + Data sets #18

tony commented Dec 14, 2013

nieldlr commented Dec 15, 2013

tony commented Dec 15, 2013

nieldlr commented Dec 15, 2013

tony commented Dec 16, 2013

tony commented Jan 19, 2014

Licensing Hanzi + Data sets #18

Licensing Hanzi + Data sets #18

Comments

tony commented Dec 14, 2013

nieldlr commented Dec 15, 2013

tony commented Dec 15, 2013

nieldlr commented Dec 15, 2013

tony commented Dec 16, 2013

tony commented Jan 19, 2014