ICUのシラバスとコースオファリングスのスクレイピングを行ってmariaDBに突っ込む感じのサムシング。ついでに日本語全文検索のためにMroongaもインストールするよ。
- selenium
- webdriver-manager
- beautifulsoup4
- lxml
- mariadb (will need mariadb conector c? Install CS package)
- tqdm (進捗確認用、必要なければ
scrape.py
のl:84for i in tqdm(range(len(regno))):
からtqdmを外す) - Flask
- falcon
かな?エラー出たら適時インストールしてください…
pip3 install selenium webdriver-manager beautifulsoup4 lxml mariadb tqdm Flask falcon
For mariaDB, this repo uses mariadb along with mroonga for Japanese full text search. Setup as follows. IF USING UBUNTU, CHECK THAT UBUNTU IS 20.04. MROONGA DOES NOT SUPPORT 22.04
Install mariadb and run mysql_secure_installation
script.
sudo apt install mariadb-server
sudo mysql_secure_installation
Then, setup users and default database.sudo mariadb
CREATE USER username@'%' IDENTIFIED BY 'passwordForUser';
CREATE DATABASE databaseName;
GRANT ALL ON databaseName.* TO username@'%';
FLUSH PRIVILEGES;
QUIT;
This will create a user that can access databaseName from anywhere, but mariadb will still refuse connections from outside local host. Therefore, go to mariadb config file. In my case it was /etc/mysql/mariadb.conf.d/50-server.cnf
. Open it with sudo nano(or Vim if you want to look cool or something). Under [mysqld]
find bind-address=127.0.0.1
. comment it out.
Next, this isn't nessesary for mariadb installation, as we're in the config file, we can add config for mroonga. Under [mysqld]
, add this to the end.
innodb_ft_min_token_size=1
ft_min_word_len=1
innodb_buffer_pool_size=1024M
innodb_log_file_size=1G
server-id=100
max_connect_errors=10000
max-connections=500
character-set-server=utf8mb4
Again, before installing mroonga engine, make shure that the os is a version supported by mroonga. Supported OS versions can be seen here. The setup in hte next section is for Ubuntu 20.04.
Mroonga should be bundled with maradb but I've never seen it. So, to install, first enable universe and security repo in ubuntu.
sudo apt-get install -y -V software-properties-common lsb-release
sudo add-apt-repository -y universe
sudo add-apt-repository "deb http://security.ubuntu.com/ubuntu $(lsb_release --short --codename)-security main restricted"
Then, add PPA to repo
sudo add-apt-repository -y ppa:groonga/ppa
sudo apt-get update
Finally install mroonga for mariadb aswell as MeCab for Japanese Tokenize
sudo apt-get install -y -V mariadb-server-mroonga
sudo apt-get install -y -V groonga-tokenizer-mecab
That's all! congrats you now should have mroonga installed on mariadb.
sudo mariadb
MariaDB [(none)]> SHOW ENGINES;
+--------+---------+----------------------------------------+
| Engine | Support | Comment |
+--------+---------+----------------------------------------+
| Mroonga| YES | CJK-ready fulltext search, column store|
| InnoDB | DEFAULT | Supports transactions, row-level loc...|
+--------+---------+----------------------------------------+
9 rows in set (0.000 sec)
To enable access from local servers,
sudo ufw allow to 192.168.0.0/16 port 3306 proto tcp
where 192.168.0.0/16
selects private network range and port 3306
is mariadb port with proto tcp
being the protocol used. This should limit the firewall so that only local devices can access the server.
sudo systemctl restart mariadb
sudo systemctl restrt ufw
for the changes to take place
mecabrc will be missing. this folder likely will be in /etc/mecabrc
for some reason. Thus, copy the folder to what mecab wants sudo cp /etc/mecabrc /usr/local/etc
.
noteable: https://engineering.linecorp.com/ja/blog/mecab-ipadic-neologd-new-words-and-expressions/ https://qiita.com/katsuyuki/items/65f79d44f5e9a0397d31 https://32imuf.com/sqlalchemy/note/ -> add full text index to TEXT is possible?
install dict using sudo apt install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file
for better dictionaries, install ipadic-neologd
git clone https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
sudo bin/install-mecab-ipadic-neologd
to find dict location, sudo find / | grep mecab-ipadic-neologd
Setup example
engine = sa.create_engine("mariadb+mariadbconnector://python:pythonAccess56@172.31.54.136:3306/syllabusdb?charset=utf8mb4",echo=True)
Base = declarative_base()
class Course(Base):
__tablename__ ="testtable"
__table_args__ = {
'mariadb_ENGINE': 'mroonga',
'mariadb_DEFAULT_CHARSET': 'utf8mb4'
}
regno = sa.Column(sa.Integer, primary_key=True)
title_j = sa.Column(sa.String(length=15900))
title_e = sa.Column(sa.String(length=400))
To access the database in a VPC using python and mariadb, the connection setup should be something like this:
conn = mariadb.connect(
user = "username",
password = "passwordForUser",
host = "local-ip-address",
port = 3306,
database = "databaseName"
)
where the username and password is from the setup done in the mariadb setup section. The host woll point to localhost or another server. In this case we are using AWS EC2 so we will use the local ip, which shouldn't change even after reboot.
mariadbTemp folder includes files for scraping data from icu and pushing it to a different ec2 instance. Should be
- Scrape.py : source of scraping
- helper.py : helper function to pharse the results from scrape.py currently used for course info
- sqlAlch.py : use MeCab to turn scraped content to tags and put them in remote mariadb
sudo apt install curl gnupg2 ca-certificates lsb-release ubuntu-keyring
curl https://nginx.org/keys/nginx_signing.key | gpg --dearmor \
| sudo tee /usr/share/keyrings/nginx-archive-keyring.gpg >/dev/null
check that key is proper
gpg --dry-run --quiet --import --import-options import-show /usr/share/keyrings/nginx-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] \
http://nginx.org/packages/ubuntu `lsb_release -cs` nginx" \
| sudo tee /etc/apt/sources.list.d/nginx.list
echo -e "Package: *\nPin: origin nginx.org\nPin: release o=nginx\nPin-Priority: 900\n" \
| sudo tee /etc/apt/preferences.d/99nginx
sudo apt update
sudo apt install nginx
to test
gunicorn -b 0.0.0.0:8080 app:application --reload --workers=6
and ///
http://icu-syllabus.com/search.html http://icu-syllabus.com:8080/api/v1/getData?search=