/Language-Classifier

Classifies languages in roman script

Primary LanguagePythonMIT LicenseMIT

World Language Classification by pjmathematician

REQUIREMENTS

pip install -r requirements.txt

INSTALLS ALL THE REQUIRED PACKAGES

HOW TO USE

THREE WAYS TO USE THE PROGRAM:

  1. WHEN input.json IS PRESENT IN THE SAME DIRECTORY AS OF THE PROJECT python main.py MAKES output.json IN THE SAME DIRECTORY AS OF THE PROJECT

  2. CUSTOM INPUT FILE python main.py /path/to/custom_input.json MAKES output.json IN THE SAME DIRECTORY AS OF THE PROJECT

  3. CUSTOM INPUT AND CUSTOM OUTPUT FILE python main.py /path/to/custom_input.json /path/to/custom_output.json MAKES CUSTOM OUTPUT FILE IN THE GIVEN PATH

HOW IT WORKS

THERE ARE THREE DATA FILES, ONE FOR EACH LANGUAGE: esp_data.txt hindi_data.txt english_data.txt EACH FILE CONTAINS THOUSANDS OF WORDS COLLECTED FROM DIFFERENT SOURCES FOR ENGLISH AND SPANISH, THE PROGRAM CHECKS THE AVAILABILITY OF A WORD IN THE CORRESPONDING DATA FILE AND ADDS THE APPROPRIATE SUFFIX. FOR HINDI, SINCE THE INPUT HAS BEEN PROVIDED IN TRANSLITERATED HINDI, THE PROGRAM USES "GESTALT PATTERN MATCHING" ALGORITHM PROVIDED IN THE BUILT IN LIBRARY difflib TO GET CLOSE MATCHES OF A WORD FROM THE DATA THESE CLOSE MATCHES ARE THEN, FURTHER COMPARED TO THE PROVIDED WORD BY USING THE "JARO SIMILARITY" ALGORITHM AND BASED ON THE RESULT, ADDS THE APPROPRIATE SUFFIX. SOME FILTERS ARE ADDED BEFORE THE MAIN PROGRAMS RUN, WHICH INCLUDE:

  1. REMOVING ALL SPECIAL CHARACTERS THAT OCCUR WITH THE WORDS
  2. SPLITTING THE WORDS SEPARATED BY '-'
  3. TOKENIZING NUMERICAL AND DATA VALUES LIKE TIME/CURRENCY ETC AND ADDING APPROPRIATE SUFFIX

WORKFLOW

DEFINING 4 FUNCTIONS:

  1. check_if_english(word)
  2. check_if_spanish(word)
  3. check_if_hindi(word)
  4. check_if_hindi_raw(word) DIFFERENCE BETWEEN check_if_hindi(word) AND check_if_hindi_raw(word) IS THE ACCURACY WHILE USING THE JARO SIMILARITY ALGORITHM. HIGHER ACCURACY IS NEEDED WHILE CHECKING IF A WORD IS BOTH ENGLISH AND HINDI.

INITIALIZING THE DATA FILES IN LISTS THE PROGRAM ADDS/EDITS THE WORDS WHEN RELEVANT WHILE READING THE FILES

THE do(text) FUNCTION THE MAIN FUNCTION OF THE PROGRAM, IT FILTERS AND MODIFIES THE RAW TEXT AND APPENDS IT TO THE worker LIST. THE LIST IS THEN ITERATED AND PASSED THROUGH THE DEFINED FUNCTIONS AND ADDING APPROPRIATE SUFFIXES.

THE world_language_classification(inpp,outp) FUNCTION PARSES THE INPUT JSON FILE AND CALLS THE do() FUNCTION. WRITES THE OUTPUT IN THE APPROPRIATE JSON FILE.

CONTACT AND MISC

I HAVE TRIED MY BEST TO KEEP THIS PROJECT FAST AND INDEPENDENT AS MUCH AS POSSIBLE.

OPEN SOURCE RESOURCES USED: jellyfish module in python https://github.com/jamesturk/jellyfish deep trans (not required for the program, used while organising the data) https://github.com/dashayushman/deep-trans LANGUAGE DATA TAKEN FROM: http://www.gwicks.net/dictionaries.htm https://github.com/Shreeshrii/hindi-hunspell

FUTURE VISION: PURCHASING WORD USAGE STATISTICAL DATA FROM SOURCES LIKE https://www.wordfrequency.info/ (FOR ENGLISH AND SPANISH) AND https://www.lexicalcomputing.com/ (FOR HINDI) AND IMPLEMENTING THE WORD SEARCH BASED ON DATA SORTED BY THE FREQUENCY, CAN MAKE THE JOB MUCH MORE ACCURATE AND FAST. THIS WOULD ONLY NEED AN UPGRADE AND UPDATE ONCE EVERY 5-10 YEARS.

REGARDING ANY QUERIES OR FEEDBACKS, PLEASE EMAIL TO pjmathematician(at)gmail(dot)com