Issue while creating infrastructure

Question

Issue while creating infrastructure

Opened this issue 5 years ago · 10 comments

I'm not sure how I got into this state, but I seem to be stuck.

The first time I ran ansible-playbook deploy_infra.yml I got:

fatal: [devnet]: FAILED! => {"attempts": 1, "changed": true, "cmd": "echo yes | /usr/local/bin/terraform apply terraform.tfplan", "delta": "0:01:59.768159", "end": "2019-10-25 11:38:20.236717", "msg": "non-zero return code", "rc": 1, "start": "2019-10-25 11:36:20.468558", "stderr": "\u001b[31m\n\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mApplicationAlreadyExistsException: lgr-explorer already exists.\n\tstatus code: 400, request id: 0f7b3ea0-d04b-4ebc-9dc9-78f15c7ded33\u001b[0m\n\n\u001b[0m  on deploy.tf line 11, in resource \"aws_codedeploy_app\" \"explorer\":\n  11: resource \"aws_codedeploy_app\" \"explorer\" \u001b[4m{\u001b[0m\n\u001b[0m\n\u001b[0m\u001b[0m\n\u001b[31m\n\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1merror creating SSM parameter: ParameterAlreadyExists: The parameter already exists. To overwrite this value, set the overwrite option in the request to true.\n\tstatus code: 400, request id: e33c8863-a4b4-4e48-9f32-235ae37a6384\u001b[0m\n\n

...etc...

I'm not super familiar with Terraform and Ansible, but I know one of the main ideas is that it allows you to be more declarative and less imperative, so I figured running again might fix it. I then got:

fatal: [devnet]: FAILED! => {"attempts": 1, "changed": true, "cmd": "echo yes | /usr/local/bin/terraform apply terraform.tfplan", "delta": "0:00:08.745482", "end": "2019-10-25 11:41:40.070775", "msg": "non-zero return code", "rc": 1, "start": "2019-10-25 11:41:31.325293", "stderr": "\u001b[31m\n\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mSaved plan is stale\u001b[0m\n\n\u001b[0mThe given plan file can no longer be applied because the state was changed by\nanother operation after the plan was created.\n\u001b[0m\u001b[0m", "stderr_lines": ["\u001b[31m", "\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mSaved plan is stale\u001b[0m", "", "\u001b[0mThe given plan file can no longer be applied because the state was changed by", "another operation after the plan was created.", "\u001b[0m\u001b[0m"], "stdout": "Acquiring state lock. This may take a few moments...\nReleasing state lock. This may take a few moments...", "stdout_lines": ["Acquiring state lock. This may take a few moments...", "Releasing state lock. This may take a few moments..."]}

Ok...nothing's deployed yet, what if I just completely destroy the infra and deploy it clean? I ran ansible-playbook destroy.yml and got:

Are you sure you want to destroy all the infra? [False]: 

PLAY [Destroy infrastructure] *************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************
ok: [devnet]

PLAY RECAP ********************************************************************************************************************************
devnet                     : ok=1    changed=0    unreachable=0    failed=0    skipped=23   rescued=0    ignored=0

Cool, so that seems clean then...? What if I run ansible-playbook deploy_infra.yml again?

fatal: [devnet]: FAILED! => {"attempts": 1, "changed": true, "cmd": "echo yes | /usr/local/bin/terraform apply terraform.tfplan", "delta": "0:00:09.712936", "end": "2019-10-25 11:47:05.948520", "msg": "non-zero return code", "rc": 1, "start": "2019-10-25 11:46:56.235584", "stderr": "\u001b[31m\n\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mSaved plan is stale\u001b[0m\n\n\u001b[0mThe given plan file can no longer be applied because the state was changed by\nanother operation after the plan was created.\n\u001b[0m\u001b[0m", "stderr_lines": ["\u001b[31m", "\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mSaved plan is stale\u001b[0m", "", "\u001b[0mThe given plan file can no longer be applied because the state was changed by", "another operation after the plan was created.", "\u001b[0m\u001b[0m"], "stdout": "Acquiring state lock. This may take a few moments...\nReleasing state lock. This may take a few moments...", "stdout_lines": ["Acquiring state lock. This may take a few moments...", "Releasing state lock. This may take a few moments..."]}

Ok, seem to be stuck. How do I get rid of the 'saved plan' or clean up all the things it's done so I can start fresh?

Answer 1 · 2019-10-25T10:37:01.000Z

@thekevinbrown

How does your hosts file look like? And what are the names and contents of the files, you created in /host_vars and /group_vars folders (do not forget to hide sensitive content if any)?

Answer 2 · 2019-10-29T00:33:12.000Z

hosts

[lgr]
devnet

group_vars/all.yml

# Infrastructure related group variables

## Exact path to the TF binary on your local machine
terraform_location: "/usr/local/bin/terraform"

## Name of the DynamoDB table where current lease of TF state file will be stored
dynamodb_table: "poa-terraform-lock"

## If ec2_ssh_key_content is empty all the virtual machines will be created with ec2_ssh_key_name key. Otherwise, playbooks will upload ec2_ssh_key_content with the name of ec2_ssh_key_name and launch virtual machines with that key
ec2_ssh_key_name: "<my ssh key name>"
ec2_ssh_key_content: ""

## VPC containing Blockscout resources will be created as following:
vpc_cidr: "10.0.0.0/16"
public_subnet_cidr: "10.0.0.0/24"
# This variable should be interpreted as following:
# Variable: 10.0.1.0/16
# Real networks: 10.0.1+{{ number of chain starting with 0 }}.0/24
db_subnet_cidr: "10.0.1.0/16"

## Internal DNS zone will looks like:
dns_zone_name: "poa.internal"

## Size of the EC2 instance EBS root volume
root_block_size: 120

# System variables
ansible_python_interpreter: "/usr/bin/python3"

# Common variables

## Credentials to connect to AWS. Either keypair or CLI profile name should be specified. If nothing is specified, the default AWS keypair is used. Region must be specified in all the cases.
#aws_access_key: ""
#aws_secret_key: ""
#aws_profile: ""
aws_region: "us-east-1"

## If set to true backend will be uploaded and stored at S3 bucket, so you can easily manage your deployment from any machine. It is highly recommended to do not change this variable
backend: true
## If this is set to true along with backend variable, this config file/the log output will be saved to s3 bucket. Please, make sure to name the config file "all.yml". Otherwise, no upload will be performed
upload_config_to_s3: true
upload_debug_info_to_s3: true

## The bucket and dynamodb_table variables will be used only when backend variable is set to true
## Name of the bucket where TF state files will be stored
bucket: "poa-terraform-state"

group_vars/lgr.yml

# Infrastructure related group variables

## Exact path to the TF binary on your local machine
terraform_location: "/usr/local/bin/terraform"

## Name of the DynamoDB table where current lease of TF state file will be stored
dynamodb_table: "legaler-blockscout-terraform-lock"

## If ec2_ssh_key_content is empty all the virtual machines will be created with ec2_ssh_key_name key. Otherwise, playbooks will upload ec2_ssh_key_content with the name of ec2_ssh_key_name and launch virtual machines with that key
ec2_ssh_key_name: "<my ssh key name>"
ec2_ssh_key_content: ""

## VPC containing Blockscout resources will be created as following:
vpc_cidr: "10.0.0.0/16"
public_subnet_cidr: "10.0.0.0/24"
# This variable should be interpreted as following:
# Variable: 10.0.1.0/16
# Real networks: 10.0.1+{{ number of chain starting with 0 }}.0/24
db_subnet_cidr: "10.0.1.0/16"

## Internal DNS zone will looks like:
dns_zone_name: "legaler.internal"

## Size of the EC2 instance EBS root volume
root_block_size: 120

# System variables
ansible_python_interpreter: "/usr/local/bin/python"

# Common variables

## Credentials to connect to AWS. Either keypair or CLI profile name should be specified. If nothing is specified, the default AWS keypair is used. Region must be specified in all the cases.
#aws_access_key: ""
#aws_secret_key: ""
aws_profile: "legaler"
aws_region: "us-east-1"

## If set to true backend will be uploaded and stored at S3 bucket, so you can easily manage your deployment from any machine. It is highly recommended to do not change this variable
backend: true
## If this is set to true along with backend variable, this config file/the log output will be saved to s3 bucket. Please, make sure to name the config file "all.yml". Otherwise, no upload will be performed
upload_config_to_s3: true
upload_debug_info_to_s3: true

## The bucket and dynamodb_table variables will be used only when backend variable is set to true
## Name of the bucket where TF state files will be stored
bucket: "legaler-blockscout-terraform-state"

host_vars/devnet.yml

terraform_location: "/usr/local/bin/terraform"

db_id: "core" # This value represents the name of the DB that will be created/attached. Must be unique. Will be prefixed with `prefix` variable.
db_name: "core" # Each network should have it's own DB. This variable maps chain to DB name. Should not be messed with db_id variable, which represents the RDS instance ID.

## The following variables describes the DB configurations for each network including usernames, password, instance class, etc.
db_username: "core"
db_password: "<a password here>"
db_instance_class: "db.t3.medium"
db_storage: "100" # in GiB
db_storage_type: "gp2" # see https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html for details
#db_iops: "1000" # This should be set only if chain_db_storage is set to `io1`
db_version: "10.6" #Blockscout uses Postgres as the DB engine. This variable describes the Postgres version used in each particular chain.

instance_type: "m5.large" # EC2 BlockScout Instance will have this type
use_placement_group: false # Choose wheter or not to group BlockScout instances into group

ansible_host: localhost # An address of machine where BlockScout staging will be built
ansible_connection: local # Comment out if your ansible_host is not localhost

chain: poa # Can be not unique. Represents chain name.

env_vars:
  NETWORK: "(Legaler)" # Name of the organization/community that hosts the chain
  SUBNETWORK: "DevNet" # Actual name of the particular network
  NETWORK_ICON: "_test_network_icon.html" # Either _test_network_icon.html or _network_icon.html, depending on the type of the network (prod/test).
  LOGO: "/images/blockscout_logo.svg" # Chain logo
  ETHEREUM_JSONRPC_VARIANT: "parity" # Chain client installed at ETHEREUM_JSONRPC_HTTP_URL
  ETHEREUM_JSONRPC_HTTP_URL: "<RPC URL>" # Network RPC endpoint
  ETHEREUM_JSONRPC_TRACE_URL: "<RPC URL>" # Network RPC endpoint in trace mode. Can be the same as the previous variable
  ETHEREUM_JSONRPC_WS_URL: "<Websockets RPC URL>" # Network RPC endpoint in websocket mode
  NETWORK_PATH: "/legaler/devnet" # relative URL path, for example: blockscout.com/$NETWORK_PATH
  SECRET_KEY_BASE: "<A secret key I generated with the openssl command on this line>" # Secret key for production assets protection. Use `mix phx.gen.secret` or `openssl rand -base64 64 | tr -d '\n'` to generate
  PORT: 4000 # Port the application runs on
  COIN: "LGR" # Coin name at the Coinmarketcap, used to display current exchange rate
  POOL_SIZE: 20 # Defines the number of database connections allowed
  ECTO_USE_SSL: "false" # Specifies whether or not to use SSL on Ecto queries
  #ALB_SSL_POLICY: "ELBSecurityPolicy-2016-08" #SSL policy for Load Balancer. Required if ECTO_USE_SSL is set to true
  #ALB_CERTIFICATE_ARN: "arn:aws:acm:us-east-1:290379793816:certificate/6d1bab74-fb46-4244-aab2-832bf519ab24" #ARN of the certificate to attach to the LB. Required if ECTO_USE_SSL is set to true
  HEART_BEAT_TIMEOUT: 30 # Heartbeat is an Erlang monitoring service that will restart BlockScout if it becomes unresponsive. This variables configures the timeout before Blockscout will be restarted.
  HEART_COMMAND: "sudo systemctl restart explorer.service" # This variable represents a command that is used to restart the service
  BLOCKSCOUT_VERSION: "v2.0.0-beta" # Added to the footer to signify the current BlockScout version
  ELIXIR_VERSION: "v1.8.1" # Elixir version to install on the node before Blockscout deploy
  BLOCK_TRANSFORMER: "base" # Transformer for blocks: base or clique.
  #GRAPHIQL_TRANSACTION: "0xbc426b4792c48d8ca31ec9786e403866e14e7f3e4d39c7f2852e518fae529ab4" # Random tx hash on the network, used as default for graphiql tx.
  TXS_COUNT_CACHE_PERIOD: 7200 # Interval in seconds to restart the task, which calculates the total txs count.
  ADDRESS_WITH_BALANCES_UPDATE_INTERVAL: 1800 #Interval in seconds to restart the task, which calculates addresses with balances
  LINK_TO_OTHER_EXPLORERS: "false" # If true, links to other explorers are added in the footer
  USE_PLACEMENT_GROUP: "false" # If true, BlockScout instance will be created in the placement group
  ##The following variables are optional
  ## SUPPORTED_CHAINS variable shoud have space before main content. This is due to the Ansible variable interpretation bug
  #SUPPORTED_CHAINS: ' [{ "title": "Legaler DevNet", "url": "https://bloclegaler.com/poa/core" }]' # JSON array with links to other exporers
  FIRST_BLOCK: 0 # The block number, where indexing begins from.
  #COINMARKETCAP_PAGES: 10 # Sets the number of pages at Coinmarketcap to search coin at. Defaults to 10
  #METADATA_CONTRACT: # Address of metadata smart contract. Used by POA Network to obtain Validators information to display in the UI
  #VALIDATORS_CONTRACT: #Address of the EMission Fund smart contract
  #SUPPLY_MODULE: "false" # Used by the xDai Chain to calculate the total supply of the chain
  #SOURCE_MODULE: "false" # Used to calculate the total supply
  #DATABASE_URL: # Database URL. Usually generated automatically, but this variable can be used to modify the URL of the databases during the updates.
  #CHECK_ORIGIN: "false" # Used to check the origin of requests when the origin header is present
  #DATADOG_HOST: # Host configuration variable for Datadog integration
  #DATADOG_PORT: # Port configuration variable for Datadog integration
  #SPANDEX_BATCH_SIZE: # Spandex and Datadog configuration setting.
  #SPANDEX_SYNC_THRESHOLD: # Spandex and Datadog configuration setting.
  #BLOCK_COUNT_CACHE_PERIOD: 600 #Time to live of block count cache in milliseconds
  #ALLOWED_EVM_VERSIONS: "homestead, tangerineWhistle, spuriousDragon, byzantium, constantinople, petersburg" #	the comma-separated list of allowed EVM versions for contracts verification
  #BUILD_* - redefine variables with BUILD_ prefix to override parameters used for building the dev server

I intend to tweak these values but wanted to get it up first.

Answer 3 · 2019-11-04T18:24:51.000Z

@thekevinbrown I checked supplied configuration files. Nothing wrong with them from my perspective. Could you please confirm that you

use Terraform >=0.12
/usr/local/bin/terraform exists
/usr/local/bin/python exists
you have all updates from the remote repo in your local master branch

If that is true, it looks like you've already created infrastructure fully or partially since the error says you already have the application with such name in Codedeploy https://console.aws.amazon.com/codesuite/codedeploy/applications. If infrastructure was not fully created I would suggest trying to destroy provisioned infrastructure https://docs.blockscout.com/for-developers/ansible-deployment/destroying-provisioned-infrastructure and then try to deploy it again with ansible-playbook deploy_infra.yml. If the error will repeat again could you please provide the full log of deployment?

Answer 4 · 2019-11-04T23:36:18.000Z

$ terraform --version
Terraform v0.12.12
$ ls /usr/local/bin/terraform 
/usr/local/bin/terraform
$ ls /usr/local/bin/python
/usr/local/bin/python
$  git status
On branch master
Your branch is behind 'origin/master' by 2 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)

nothing to commit, working tree clean

I'll pull down those latest two commits, then if that doesn't fix it I'll try destroying and recreating, thanks.

Answer 5 · 2019-11-05T02:45:05.000Z

The latest two commits were updates to the Readme, so doubt they would make much difference.

Running ansible-playbook destroy.yml definitely did more than the last time I ran it.
After a few minutes of it running, Python crashed:

objc[13279]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[13279]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

I then googled and found this. So I did export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES before running destroy, which then gave me this:

TASK [destroy : Fetch environment variables (via profile)] **************************************************************************
fatal: [devnet]: FAILED! => {"msg": "An unhandled exception occurred while running the lookup plugin 'aws_ssm'. Error was a <class 'botocore.exceptions.NoRegionError'>, original message: You must specify a region."}

But the region is specified in group_vars/lgr.yml as you can see above.

So this leads me to two questions:

I'm really not convinced in the stability of these tools to keep the infra running. Is there any way we can get a list of the infra that needs to be created and configured so we can deploy it manually?
How can I clean up what ansible did without using it?

Thanks again for your help.

Answer 6 · 2019-11-05T08:41:59.000Z

@thekevinbrown

Try to disable aws profile in the config and leave only aws-region there. The credentials will be retrieved from your machine:

## Credentials to connect to AWS. Either keypair or CLI profile name should be specified. If nothing is specified, the default AWS keypair is used. Region must be specified in all the cases.
#aws_access_key: ""
#aws_secret_key: ""
#aws_profile: "legaler"
aws_region: "us-east-1"

Regarding Python version, I'd recommend using python3. In this case, you need to change the path to the ansible_python_interpreter in both configs. I also recently had the problems with the last deployment through Terraform using python@2 on Mac OS Catalina. So, after reinstalling Ansible with python3, it worked fine to me. This is my ansible version:

MacBook-Pro-Viktor:BS keys viktor$ ansible --version
ansible 2.8.5
  config file = None
  configured module search path = ['/Users/viktor/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/Cellar/ansible/2.8.5_1/libexec/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.4 (default, Oct 12 2019, 19:06:48) [Clang 11.0.0 (clang-1100.0.33.8)]

I'm really not convinced in the stability of these tools to keep the infra running. Is there any way we can get a list of the infra that needs to be created and configured so we can deploy it manually?

Basically, this is the main deployment tool, we use to deploy new chains and release updates at blockscout.com. Once properly configured, it should work fine. Yes, if you want to deploy it manually from scratch on AWS, you need to put huge efforts. The list of AWS features are used to build infra is here https://docs.blockscout.com/for-developers/ansible-deployment/aws-permissions but it should be mentioned there is no step-by-step instruction on how to deploy the whole infra manually.

How can I clean up what ansible did without using it?

Take a look here. Here is the manual cleaning instruction https://forum.poa.network/t/aws-settings-for-blockscout-terraform-deployment/1962#cleaning

Answer 7 · 2019-11-14T04:05:03.000Z

Hi @vbaranov,

I've now changed to using python3 and leaving just aws_region in there. Here's the output.

fatal: [devnet]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'region' is undefined\n\nThe error appears to be in '/Users/kevin/development/blockscout-terraform/roles/destroy/tasks/parameter_store.yml': line 1, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Fetch environment variables (via access key)\n  ^ here\n"}

Answer 8 · 2019-11-14T08:31:30.000Z

Hi @thekevinbrown

Did you set-up AWS region by means of aws configure? What is the output of cat ~/.aws/config?
In my case, it is:

MacBook-Pro-Viktor:~ viktor$ cat ~/.aws/config
[default]
region = us-east-1
output = json

Also, I suggest you leave aws_region only in the config file under group_vars folder and remove aws_... parameters from the file which is under host_vars folder.

Anyways, what is the output of ansible --version in your case now?

Answer 9 · 2019-12-10T06:54:44.000Z

I don't remember how I set up my AWS config as that was at least 2 years ago. Here's ~/.aws/config:

[default]
region = ap-southeast-2

We're based in Sydney so that's a good default for the rest of our AWS work, but I want this deployed in US East 1, hence the settings in the group vars files.

$ ansible --version
ansible 2.9.1
  config file = /Users/kevin/development/blockscout-terraform/ansible.cfg
  configured module search path = ['/Users/kevin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]

As far as only having aws config in group_vars and not in host_vars I'm not sure what you mean. There are no aws_... values in the host_vars file I posted above, no? What am I missing?

Answer 10 · 2020-02-05T13:20:09.000Z

@thekevinbrown It looks like your issue with deployment because of the version of Terraform used. The current code from the master branch, indeed, is not stable to use with Terraform 12.x. I've updated our docs with the warning

Deployment with Terraform 12 is unstable due to these bugs: #144, #147, #148, #149. Please use TF 11.11 - 11.14 and following branch for deployment https://github.com/poanetwork/blockscout-terraform/tree/before-t12

In order to clean up previously failed deployment, you need to delete corresponding assets in the next AWS resources:

EC2
- Autoscaling group
- Launching configuration
- Load Balancer
- Target group
- Network interface
RDS instance
VPC
DynamoDB tables
IAM roles
CodeDeploy application
S3 buckets
Route 53