Data source for Lexique interfaces
- Required:
- Ruby v1.8+
- Optional:
Top-level files include dependency management (Gemfile
), conversion scripts (csv2sqlite
), and the README.
Contains all of the original releases from Lexique, their documentation, and their CSV conversion. Sub-folders are organized by release version (e.g. lexique.org/3.80
for Lexique 3.80).
Contains the SQLite database files for each version; this directory is not organized by version.
Lexique is distributed as an Excel file (.xlsb). In order to use it effectively in any environment, it must be converted to a more usable format. CSV (comma-separated value) is a good intermediary, but becomes inefficient and is not very functional. Thus, we convert Lexique into a SQLite database.
-
Convert the Excel file to CSV
- Open the file with Excel or LibreOffice Calc, and save as CSV with tab delimeters (the
csv2sqlite
script expects CSVs to be delimited with'\t'
)
- Open the file with Excel or LibreOffice Calc, and save as CSV with tab delimeters (the
-
Rename the CSV column headings
- Remove the column number from the heading (e.g. rename
1_ortho
toortho
) - Clean up anything else extraneous for sane column names (e.g. rename
freqlemfilms2
tofreqlemfilms
) - Ensure that
-
is converted to_
, because thecsv2sqlite
script will throw away-
during conversion (Ruby'sto_sym
method ignores-
)
- Remove the column number from the heading (e.g. rename
-
Use
csv2sqlite.rb
to convert the CSV with the following command:ruby csv2sqlite.rb -o Lexique380.db ./lexique.org/3.80/Lexique380.csv words
- For help with
csv2sqlite
, run:
- For help with
ruby csv2sqlite.rb -h
or ruby csv2sqlite.rb --help
4. Optional: transliterate the words
- For easier searching, we can transliterate the words in the database, which will remove accents from the words. However, we don't want to modify the accented entry directly, we want to create a new column to hold the transliterated version of the word. transliterate.rb
will take data from column_name
, transliterate it, and enter it into to_column
. Here's how to run the script:
`ruby transliterate.rb Lexique380.db words ortho ortho_sa orthrenv orthrenv_sa`
- The above command will transliterate the word (in column `ortho` and put the result in `ortho_sa` [orthographe sans accents]). Columns can be added in pairs of two (e.g. with the additional arguments `orthrenv` and `orthrenv_sa` as shown above).
- For help with `transliterate`, run:
ruby transliterate.rb -h
or ruby transliterate.rb --help
4. Move the Lexique_VERSION.db
file from the root directory to the sqlite
directory
The Lexique SQLite file can be used with other languages/frameworks, e.g.:
- Java:
- ActiveJDBC - an object-relational mapping library (preferred)
- JDBC - a direct adapter
- Ruby:
Depending on its use, the database may need to be modified, such as indexing fields that are often used (such as ortho
). In order to do so, Sqliteman is a good tool to use.