ICD-10 Codes are codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. These codes are important in the treatment of people by accurately diagnosing their diseases and conditions.
The codes are located here: https://www.cms.gov/medicare/coordination-benefits-recovery/overview/icd-code-lists
One list of ICD-10 Codes is available here in only XLSX format: https://www.cms.gov/files/document/valid-icd-10-list.xlsx
The ICD-10 codes are a mess and the .xlsx
format isn't useful for anyone beyond those that live in an EXCEL compatible program.
The purpose of this repository are 2 fold:
- Creating processes to show how to convert this
.xlsx
file into other formats AND standardizing the data in such a way that it is more useful for developers and data scientists. - Showcasing different methods to perform these translations given a real-world example of having "messy" data.
The quickest way to convert the data from .xlsx
to .csv
is to:
- Download the
.xlsx
file locally - Import the file into Google Sheets
- Export from Google Sheets a
.csv
file
The data will contain the following issues that will need to be processed:
- Certain records will have a comma value within double-quotes that will cause problems since each comma is a delimiter
ABC123, "Testing, one", "Testing, two",
- Certain records will have a "[]" character pair surrounding a specific word
B0821,Exanthema subitum [sixth disease] due to human herpesvirus 6,Exanthema subitum [sixth disease] due to human herpesvirus 6,
- Certain records will have a "[]" character pair AND a
'
character:
A751,Recrudescent typhus [Brill's disease],Recrudescent typhus [Brill's disease],
- Certain records will have a comma separated value within double-quotes that needs to be replaced AND have "[]" characters
DEFG456, "[EXCLUDE] Testing, three", "[EXCLUDE] Testing, four",
- Certain records will return early and be broken up into 2 separate lines:
"H02881
",Meibomian gland dysfunction right upper eyelid,Meibomian gland dysfunction right upper eyelid,
Take into consideration all of the edge-cases listed above and make the sample data fit into the following columns:
CODE,SHORT DESCRIPTION (VALID ICD-10 FY2024),LONG DESCRIPTION (VALID ICD-10 FY2024),NF EXCL
Where NF EXCL
may or may not be present in the data and could be a blank column.
Example:
ABC123, "Testing, one", "Testing, two",
should be transformed into:
ABC123, "Testing - one", "Testing - two",
Assuming that the .csv
is named ICD-10-codes.csv
.
normalizer.sh
will take in the input .csv
file and an output .csv
file.
./normalizer.sh ICD-10-codes.csv ICD-10-codes-final.csv