File format

There are both soft and hard limits on the size of individual files on GitHub. (IIRC: the soft limit is 50MB, the hard limit is 100 MB.) For a few languages, we circumvent this by splitting the file into smaller pieces and suffixing `.1`, `.2`, etc. [Finnish is an example](https://github.com/unimorph/fin). Here, however, there is an old Polish file (presumably from the Wiktionary scrape) called `pol` and then a new, much larger compressed file called `pol.zip`. It seems to me the following is true:

* if the new data is better we should get rid of the old data; in the unlikely event someone wants the old data, it's there in the commit history.
* we should standardize the file names. I'm not sure if the Finnish style or the Polish style of handling big files is more common but standardization comes with its own benefits; someone trying to use UniMorph across languages would have to handle both styles (and possibly others I'm not awarwe of). 

On reflection, I think I'd prefer a single compressed file over files split up by unclear means, so the Polish style is probably better. (That said, `.zip` is a legacy format for compression, designed for use with multiple files and directories rather than single files. For single-file compression, `.gz`, `.bz2`, and `.xz` are all more appropriate and have better licenses.) But this is a pan-language issue and not specific to Polish, despite me bringing it up here...

(Sorry to keep "beating up" on the Polish data. Clearly I decided it was useful enough to look at in detail, so thanks to @wkieras and other authors for adding it to UniMorph.) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File format #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File format #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions