Want to find out how many physical books you have in ExLibris Alma that you also have electronic versions of? If so then this is for you.
This application analyses the MARCXML files from ExLibris Alma to find the overlap of your physical and electronic records in your holdings based on ISBN and also by title.
It outputs files that can be imported into ExLibris Alma as sets.
In ExLibris Alma a single record may have multiple ISBNs, this application normalises the Alma data based on ISBN allowing for the matching of against all ISBN of phyiscal records against all ISBNs of electronic records.
Here is a small sample of records where ISBNs are the same
Alma ID (physical) | ISBN (physical) | ISBN (electronic) | Title (physcial) | Title (electronic) |
---|---|---|---|---|
9914634010001361 | 9780123820204 | 9780123820204 | database modeling and design | database modeling and design |
9914634010001361 | 123820200 | 123820200 | database modeling and design | database modeling and design |
9914632440001361 | 871706865 | 871706865 | titanium | titanium |
9914634110001361 | 9781597496537 | 9781597496537 | the basics of information security | the basics of information security |
9914633230001361 | 7144033 | 7144033 | manufacturing planning and control systems for supply chain management | manufacturing planning and control systems for supply chain management fifth edition |
9914633350001361 | 9781849691086 | 9781849691086 | iphone javascript cookbook | iphone javascript cookbook |
From this data we get just the unique Alma IDs for the matching physical records and export the results to a file that ExLibris Alma can import as a set.
Given that not all electronic records have ISBNs (or the correct ISBNs) you can also analyse for duplicates based on title. As many electronic record are catalogued differently, titles may be slightly different. We cater for this siutation by stripping all punctional, extra formatting and lower case the result then checking against only the first 50 characters. Also to narrow the results we are excluding records that already match on ISBN.
Here is a sample result
Alma ID (physical) | Alma ID (electronic) | Title (physial) | Title (electronic) |
---|---|---|---|
9914706280001361 | 99333124301361 | managing people | managing people |
9914727370001361 | 99333124301361 | managing people | managing people |
9914845200001361 | 99333439301361 | software engineering | software engineering |
9914758580001361 | 9914687570001361 | electric power systems | electric power systems |
9914836540001361 | 99333471201361 | color | color |
9914857210001361 | 99333471201361 | color | color |
9914852230001361 | 99333515301361 | writing winning business proposals | writing winning business proposals |
9914758580001361 | 9914689900001361 | electric power systems | electric power systems |
9914833850001361 | 9914988860001361 | as time goes by | as time goes by |
9914834330001361 | 9915002000001361 | software architecture | software architecture |
Again these results can be exported into a format the ExLibrs Alma can import as a set.
ISBN matches are very likely to be exactly the same record in a different format.
Titles matches are far less likely to be exactly the same item in a different format - as many titles are generically worded or named the same - which leads to non-accurate matches.
Assume that only 20% or less of the title matches are actually the same
Note: Titles matching excludes records where their ISBNs match - the title matching records complement the ISBN matching results.
To analyse the ExLibris Alma data we first need to get full outputs files from Alma of all the physical and electronic records.
To do this create an ALma set for physical books
Create an Alma set for electronic books
Then for each set, run a job to export the records
Select MARC21 XML as the output format, select 10,000 as the number of records in a file and select your FTP server
Once you have all the MARCXML files for both the electronic and print records save them into the folders on your server/workstation where you'll run this script
Save the electronic records into a folder called 'electronic' and physical records into a folder called 'print'
Here is a screenshot of how i've stored the MARCXML files
On your MySQL server create a new blank database and import the database_setup.sql file
- or just copy and paste the contents of the file into your MySQL query console to execute
Edit the import_xml_to_mysql.sql file and update it with your MySQL database connection information
## MySQL database connection details
my $database_name = "alma_title_match";
my $database_host = "localhost";
my $database_user = "root";
my $database_password = "";
my $database_port = "3306";
This import script is written in Perl, please ensure you have the below Perl modules installed on your system
- XML::Simple
- Data::Dumper
- DBI
The script runs best on a Linux type system, it just needs Perl, the above Perl modules, and MySQL client software installed to run.
Once you have edited the import file to add int he MySQL connection details run the import script over the MARCXML files to convert MARCXML to MySQL
In the directory where you have the import script and MARCXML files to import the phyiscal records run
Note: the first argument 'physical' records into the Database the type of iles the second argument 'print' define the directory that stores the MARCXML files for the physical/print records.
perl import_xml_to_mysql.pl 'physical' 'print' > outpul_physical.txt
This will take a number of hours, you can tail the output_physical.txt file to monitor the progress
To import the electronic records run
perl import_xml_to_mysql.pl 'electronic' 'electronic' > output_electronic.txt
Note: the first argument 'electronic' records into the Database the type of files the second argument 'electronic' define the directory that stores the MARCXML files for the electronic records.
This will take a number of hours, you can tail the output_electronic.txt file to monitor the progress
These 2 script convert all the MARCXML files in the specified directories into your MySQL database.
Both imports can take a number of hours based on the number of records and type of server.
Once done we can analyse the data use SQL.
This is a query to get a sample of your results with titles, Alma ID and ISBNs so can you review the results to ensure that they are accurate
select
distinct
overlap_physical.alma_id,
overlap_physical.isbn,
overlap_electronic.isbn,
overlap_physical.title,
overlap_electronic.title
from
overlap_physical
INNER JOIN
overlap_electronic ON overlap_physical.isbn = overlap_electronic.isbn
limit 20;
Heres is a sample result
Alma ID (physical) | ISBN (physical) | ISBN (electronic) | Title (physcial) | Title (electronic) |
---|---|---|---|---|
9914634010001361 | 9780123820204 | 9780123820204 | database modeling and design | database modeling and design |
9914634010001361 | 123820200 | 123820200 | database modeling and design | database modeling and design |
9914632440001361 | 871706865 | 871706865 | titanium | titanium |
Once you're happy with the results run this query to output a list of unique Alma IDs of physical records that also have electronic copies
select
distinct
overlap_physical.alma_id
from
overlap_physical
INNER JOIN
overlap_electronic ON overlap_physical.isbn = overlap_electronic.isbn
INTO OUTFILE '/tmp/alma_isbn.csv' FIELDS TERMINATED BY ',';
Here is a sample result , note that only unique Alma IDs are selected - there are no duplicates - unlike the previos tseting query
Alma ID (physical) |
---|
9914634010001361 |
9914632440001361 |
This CSV file can be saved in Excel as .xls or .xlsx and import into Alma as a set for further analyses.
Run this query to save the matching ISBNs into a separate table - this is to speed up the title matching queries in the next section
insert
into
overlap_overlap (alma_id)
select
DISTINCT
overlap_physical.alma_id
from
overlap_physical
INNER JOIN
overlap_electronic ON overlap_physical.isbn = overlap_electronic.isbn ;
To match electronic and physical title by not - where they have not already been matched on ISBN in the previous query - run the below query.
This output a csv file to /tmp called alma_titles.csv
SELECT
DISTINCT
overlap_physical.alma_id as p_alma_id,
overlap_electronic.alma_id as e_alma_id,
overlap_physical.title as p_title,
overlap_electronic.title as e_title
from
overlap_physical,
overlap_electronic
where
LEFT(overlap_physical.title,50) = LEFT(overlap_electronic.title,50)
AND
overlap_physical.isbn != overlap_electronic.isbn
AND
overlap_physical.alma_id NOT IN (select overlap_overlap.alma_id from overlap_overlap)
INTO OUTFILE '/tmp/alma_titles.csv' FIELDS TERMINATED BY ',';
Here is a sample result
Alma ID (physical) | Alma ID (electronic) | Title (physial) | Title (electronic) |
---|---|---|---|
9914706280001361 | 99333124301361 | managing people | managing people |
9914727370001361 | 99333124301361 | managing people | managing people |
9914845200001361 | 99333439301361 | software engineering | software engineering |
Again this CSV file can be saved in Excel as .xls or .xlsx and import into Alma as a set for further analyses.