-
Notifications
You must be signed in to change notification settings - Fork 0
Description
There are both soft and hard limits on the size of individual files on GitHub. (IIRC: the soft limit is 50MB, the hard limit is 100 MB.) For a few languages, we circumvent this by splitting the file into smaller pieces and suffixing .1, .2, etc. Finnish is an example. Here, however, there is an old Polish file (presumably from the Wiktionary scrape) called pol and then a new, much larger compressed file called pol.zip. It seems to me the following is true:
- if the new data is better we should get rid of the old data; in the unlikely event someone wants the old data, it's there in the commit history.
- we should standardize the file names. I'm not sure if the Finnish style or the Polish style of handling big files is more common but standardization comes with its own benefits; someone trying to use UniMorph across languages would have to handle both styles (and possibly others I'm not awarwe of).
On reflection, I think I'd prefer a single compressed file over files split up by unclear means, so the Polish style is probably better. (That said, .zip is a legacy format for compression, designed for use with multiple files and directories rather than single files. For single-file compression, .gz, .bz2, and .xz are all more appropriate and have better licenses.) But this is a pan-language issue and not specific to Polish, despite me bringing it up here...
(Sorry to keep "beating up" on the Polish data. Clearly I decided it was useful enough to look at in detail, so thanks to @wkieras and other authors for adding it to UniMorph.)