Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Licensing Hanzi + Data sets #18

Open
tony opened this issue Dec 14, 2013 · 5 comments
Open

Licensing Hanzi + Data sets #18

tony opened this issue Dec 14, 2013 · 5 comments

Comments

@tony
Copy link

tony commented Dec 14, 2013

Greetings @nieldlr !

What is the license with Hanzi? I prefer MIT, BSD and Apache.

Separately, what is the copyright, licensing and source of the data sets @ https://github.com/nieldlr/Hanzi/tree/master/lib/data?

@nieldlr
Copy link
Owner

nieldlr commented Dec 15, 2013

Hello @tony,

thanks so much for bringing this up. The license was long overdue. I've updated the readme and added an MIT license.

For the data it gets a bit more trickier. You can see I listed all of the ones that were not created by me or generated by HanziJS. Most of the data will be okay with commercial use, as their licenses allow it, if you want to use it in that way, however, it depends on how you use it. For example, I contacted the creators of the Leiden University Word Frequency Corpus and they seemed happy to let me use it in HanziCraft.com as long I don't sell the actual data.

What are your building? I'm very curious! Would love add it to the projects that use HanziJS list.

@tony
Copy link
Author

tony commented Dec 15, 2013

@nieldlr I'm happy to meet you.

Hanzi looks great.

I think it would be appropriate that LICENSES of the data set be in documentation. Since Hanzi is a distribution, the license of the dataset(s) should be official.

I kind of want to act as a missionary right now:

ODC / Open Data Commons Attribution License (ODC-By) v1.0 - http://opendatacommons.org/licenses/by/summary/ - http://opendatacommons.org/licenses/by/1.0/ - Simple, guarantees attribution.

as long I don't sell the actual data

I am following the idea of http://okfn.org/opendata/. IANAL, but from what I understand - this would technically allow the data to be sold wholesale.. I understand where the owner of a dataset would have concern.

However, in real life, we could just burn FreeBSD CD's and hang out on street corners all day like Jay and Silent Bob. I think that moving datasets to Open Data Commons format seems nice, since ultimately, it's about what the implementer does with it.

In practice, it protects the real interests of the data provider. Giving attribution to the original provider, and assuring the public data can be used by the world.

I am in the process of a similar project for python. I am wrangling together cjk datasets and trying to get them under MIT/ODC licenses. cburgmer/cjklib#6.

When it's the right time, I will chime in with what I'm working on. Would you be interested in some sort of a collaboration to cover the rest useful hanzi data sets? I can help fill in / PR datasets you are missing, maybe you / we can contact the data providers and see if they can release theirs under ODC?

Edit: I noticed the LICENSE in the README linking to http://lwc.daanvanesch.nl/legal.php, http://lingua.mtsu.edu/chinese-computing/copyright.html. I think for these two providers, it'd be preferable to see them under ODC/CC-type license. BTW, The chinese character decomposition by Gavin is ODC now.

@nieldlr
Copy link
Owner

nieldlr commented Dec 15, 2013

Hey @tony,

this is great! Thanks so much for spearheading this.

I've had a similar idea to open source data collections for learning materials for learners and developers in the past. I really like the open data commons. First time I've seen it!

I'll release my data under ODC. In fact, I'll clean up the directory a bit to clear out data not being used at present/anymore. I'll then state which ones are mine, released under ODC and then list the other datasets as well.

It's great to see that Gavin has released his under ODC. I'll contact the other two and update you.

Just one question, I had a look at the licenses, I really like this one a bit more: http://opendatacommons.org/licenses/odbl/summary/
It restricts nefarious uses of the data for me. It would allow others to open up their data as well. Even if it's not immediately clear that it can accessed in that way, I would make me sleep better at night. :)

Thanks again for this. I really like these ideas!

@tony
Copy link
Author

tony commented Dec 16, 2013

@nieldlr: I am ok with ODBL.

In practice, assurances granted in writing are common sense, but in open source, permissiveness tends to be the best. I go into it here ScottDuckworth/python-anyvcs#32 (comment). GPL is a hardcore example of compliance. My point I go into is, in practice, it's in a self-interested best interest to PR / patch upstream. It takes additional energy and time to maintain a fork.

At worst, there is someone with a DRM'd version of a library who's wasting his time keeping up the sync with the main dataset. The common moocher isn't the type who is going to be holding back a useful patch anyway.

If there's an off shoot chance a genius grabs the data set and creates a better derivative, good for them. It may be in their best interest to patch back to the original for 1.) glory 2.) avoid having to synchronize the diff. They still are required to give attribution.

I'm fine with ODBL and ODC. I prefer ODC.

@tony
Copy link
Author

tony commented Jan 19, 2014

@nieldlr: https://github.com/tony/cihai/tree/master/cihai/datasets/unihan that is worth keeping an eye on, I have a README there that outlines some standards I found for the cjkdata. I will probably separate it into a different repo soon.

Edit: Now https://github.com/cihai, https://github.com/cihai/cihai-handbook, https://github.com/cihai/cihaidata-python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants