Skip to content

Commit e5a1102

Browse files
committed
WIP2...
1 parent 6b70281 commit e5a1102

File tree

5 files changed

+81
-35
lines changed

5 files changed

+81
-35
lines changed

.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
UTA_DATABASE_SCHEMA=uta_20210129b
1+
UTA_DATABASE_SCHEMA=uta_20240523b
22
UTILITIES_DATA_VERSION=113c119

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
.pytest_cache
55
__pycache__
66
.venv
7+
/seqrepo
78
/refseq
89
/data
910
/tmp

README.md

Lines changed: 75 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,17 @@ The operations return the following status codes:
4242

4343
## Testing
4444

45-
To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code Testing functionality which should discover them automatically. You can also
46-
run `python3 -m pytest` from the terminal to execute them all.
45+
For local development, you will have to create a `secrets.env` file in the root of the repo and add in it the MongoDB
46+
password and, optionally, the UTA Postgres database connection string (see the UTA section below for details):
47+
48+
```
49+
MONGODB_READONLY_PASSWORD=...
50+
UTA_DATABASE_URL=...
51+
```
52+
53+
To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code
54+
Testing functionality which should discover them automatically. You can also run `python3 -m pytest` from the terminal
55+
to execute them all.
4756

4857
Additionally, since the tests run against the Mongo DB database, if you need to update the test data in this repo, you
4958
can run `OVERWRITE_TEST_EXPECTED_DATA=true python3 -m pytest` from the terminal and then create a pull request with the
@@ -80,25 +89,60 @@ normalisation requires access to a copy of the [UTA](https://github.com/biocommo
8089
We have provisioned a Heroku Postgres instance in the Prod environment which contains the imported data from a database
8190
dump as described [here](https://github.com/biocommons/uta#installing-from-database-dumps).
8291

83-
The connection string for this database can be found in Heroku under the `UTA_DATABASE_URL` environment variable.
92+
We define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name of the
93+
currently imported database schema.
8494

85-
Additionally, we define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name
86-
of the currently imported database schema.
95+
#### Database import procedure (it will take about 30 minutes):
8796

88-
Database import procedure (it will take about 10 minutes):
97+
- Go to the UTA dump download site (http://dl.biocommons.org/uta/) and get the latest `<UTA_SCHEMA>.pgd.gz` file.
98+
- Go to https://dashboard.heroku.com/apps/fhir-gen-ops/resources and click on the "Heroku Postgres" instance (it will
99+
open a new window)
100+
- Go to the Settings tab
101+
- Click "View Credentials"
102+
- Use the fields from this window to fill in the variables below
89103

90104
```shell
91-
> UTA_SCHEMA="uta_20210129b" # Specify the UTA schema you wish to use
92-
> PGPASSWORD="${POSTGRES_PASSWORD}"
93-
> gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v anonymous | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
105+
$ POSTGRES_HOST="<Heroku Postgres Host>"
106+
$ POSTGRES_DATABASE="<Heroku Postgres Database>"
107+
$ POSTGRES_USER="<Heroku Postgres User>"
108+
$ PGPASSWORD="<Heroku Postgres Password>"
109+
$ UTA_SCHEMA="<UTA Schema>" # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
110+
$ gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v '^GRANT USAGE ON SCHEMA .* TO anonymous;$' | grep -v '^ALTER .* OWNER TO uta_admin;$' | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
94111
```
95112

96-
Note: `grep -v anonymous` is required because it's not possible to create an `anonymous` role in Heroku Postgres.
113+
Note: The `grep -v` commands are required because the Heroku Postgres instance doesn't allow us to create a new role.
114+
115+
Once complete, make sure you update the `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file and commit
116+
it.
117+
118+
#### Connection string
97119

98-
Once the process finishes, if you are using the Heroku Postgres Basic plan on the
99-
[Essential Tier](https://devcenter.heroku.com/articles/heroku-postgres-plans#essential-tier), you'll bump into the 10
100-
million rows / database limit. However, it's safe to ignore the warnings about this limit, since Heroku will simply
101-
revoke INSERT privileges from the database and the hgvs library only needs read-only access to this database.
120+
The connection string for this database can be found in the same Heroku Postgres Settings tab under "View Credentials".
121+
It is pre-populated in the Heroku runtime under the `UTA_DATABASE_URL` environment variable. Additionally, we set the
122+
same `UTA_DATABASE_URL` environment variable in GitHub so the CI can can use this database when running the tests.
123+
124+
For local development, if you'd like to use this Postgres instance instead of the HGVS public one
125+
(`postgresql://anonymous:anonymous@uta.biocommons.org/uta`), please add `UTA_DATABASE_URL` with the Heroku Postgres
126+
connection string in the `secrets.env` file.
127+
128+
#### Testing the database
129+
130+
```shell
131+
$ pgcli "${UTA_DATABASE_URL}"
132+
> set schema '<UTA Schema>'; # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
133+
> select count(*) from alembic_version
134+
union select count(*) from associated_accessions
135+
union select count(*) from exon
136+
union select count(*) from exon_aln
137+
union select count(*) from exon_set
138+
union select count(*) from gene
139+
union select count(*) from meta
140+
union select count(*) from origin
141+
union select count(*) from seq
142+
union select count(*) from seq_anno
143+
union select count(*) from transcript
144+
union select count(*) from translation_exception;
145+
```
102146
103147
### RefSeq data
104148
@@ -109,18 +153,21 @@ To update the RefSeq data, you will have to install `seqrepo` locally and run `.
109153
is a step-by-step guide on how to do this:
110154
111155
```shell
112-
> mkdir seqrepo
113-
> cd seqrepo
114-
> python3 -m venv .venv
115-
> . .venv/bin/activate
116-
> pip install setuptools
117-
> pip install biocommons.seqrepo
118-
> seqrepo -r . pull --update-latest
119-
> # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which got created):
120-
> # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
121-
>
122-
> # cd to genomics-operations repo
123-
> python ./utilities/pack_seqrepo_data.py --seqrepo_dir /path/to/seqrepo/dir/latest
124-
> # Upload tar archives from ./tmp/ to a new GitHub release and then update `UTILITIES_DATA_VERSION` in the `.env` file
125-
> # such that it contains the short SHA of the new release which contains the updated data.
156+
$ mkdir seqrepo
157+
$ cd seqrepo
158+
$ python3 -m venv .venv
159+
$ . .venv/bin/activate
160+
$ pip install setuptools==75.7.0
161+
$ pip install biocommons.seqrepo==0.6.9
162+
$ # See https://github.com/biocommons/biocommons.seqrepo/issues/171 for a bug that's causing issues with the builtin
163+
$ # rsync on OSX.
164+
$ brew install rsync # OSX-specific. Guess the standard package managers have it available on Linux
165+
$ seqrepo --rsync-exe /opt/homebrew/bin/rsync -r . pull --update-latest
166+
$ # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which got created):
167+
$ # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
168+
$
169+
$ # cd to genomics-operations repo
170+
$ python ./utilities/pack_seqrepo_data.py --seqrepo_dir /path/to/seqrepo/dir/latest
171+
$ # Upload tar archives from ./tmp/ to a new GitHub release and then update `UTILITIES_DATA_VERSION` in the `.env` file
172+
$ # such that it contains the short SHA of the new release which contains the updated data.
126173
```

app/input_normalization.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
port = os.getenv('PORT', 5000) # The localhost debugger starts the app on port 5000
1414
os.environ['HGVS_SEQREPO_URL'] = f"http://localhost:{port}/utilities/seqfetcher"
1515

16-
database_schema = os.getenv('UTA_DATABASE_SCHEMA', 'uta_20210129b')
16+
database_schema = os.getenv('UTA_DATABASE_SCHEMA', 'uta_20240523b')
1717
# Use the biocommons UTA database if we don't specify a custom one.
1818
# Also, make sure the URL uses `postgresql` instead of `postgres` as schema
1919
database_url = f"{os.getenv('UTA_DATABASE_URL', 'postgresql://anonymous:anonymous@uta.biocommons.org/uta')}/{database_schema}".replace('postgres://', 'postgresql://')
@@ -67,7 +67,6 @@ def normalize_variant(parsed_variant):
6767

6868
def process_NM_HGVS(NM_HGVS):
6969
parsed_variant = hgvsParser.parse_hgvs_variant(NM_HGVS)
70-
print(f"parsed: {parsed_variant}")
7170

7271
projected_variant_dict = project_variant(parsed_variant)
7372
print(
@@ -85,7 +84,6 @@ def process_NM_HGVS(NM_HGVS):
8584

8685
def process_NC_HGVS(NC_HGVS):
8786
parsed_variant = hgvsParser.parse_hgvs_variant(NC_HGVS)
88-
print(f"parsed: {parsed_variant}")
8987

9088
try:
9189
transcripts = b38hgvsAssemblyMapper.relevant_transcripts(parsed_variant)

utilities/pack_seqrepo_data.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,10 @@
1616

1717
parser.add_argument('--uta_database_schema',
1818
help='UTA database schema',
19-
default=os.environ.get('UTA_DATABASE_SCHEMA'))
19+
default=os.getenv('UTA_DATABASE_SCHEMA', 'uta_20240523b'))
2020
parser.add_argument('--uta_database_url', help='UTA database URL',
21-
default='postgresql://anonymous:anonymous@uta.biocommons.org/uta')
21+
# Use the biocommons UTA database if we don't specify a custom one.
22+
default=os.getenv('UTA_DATABASE_URL', 'postgresql://anonymous:anonymous@uta.biocommons.org/uta'))
2223
parser.add_argument('--seqrepo_dir',
2324
help='Seqrepo directory',
2425
default='/usr/local/share/seqrepo/latest')
@@ -27,7 +28,6 @@
2728
default='tmp')
2829
args = parser.parse_args()
2930

30-
# Use the biocommons UTA database if we don't specify a custom one.
3131
# Also, make sure the URL uses `postgresql` instead of `postgres` as schema
3232
database_url = f"{args.uta_database_url}/{args.uta_database_schema}".replace('postgres://', 'postgresql://')
3333

0 commit comments

Comments
 (0)