[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: analytics suite #1626

Merged
merged 48 commits into from
Jul 1, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
470956a
feature: logging middleware
densumesh Jun 18, 2024
c159f94
feature: log events to clickhouse
densumesh Jun 21, 2024
46629b1
feature: send clickhouse data in handler
densumesh Jun 21, 2024
f9317c9
feature: query vector in clickhouse
densumesh Jun 21, 2024
cff1dec
bugfix: fix async inserts, feature: add fulltext vectors
densumesh Jun 22, 2024
7a40d54
feaure: embed and store clusters in clickhouse
densumesh Jun 22, 2024
94b9806
cleanup: dont send vectors to clickhouse
densumesh Jun 22, 2024
610f56e
clenaup: clippy fixes
densumesh Jun 22, 2024
c2d9ddf
revert: add back self-hosting.md
densumesh Jun 22, 2024
4f51885
feature: access analytics from server routes
densumesh Jun 25, 2024
811e1ae
fmt: cargo fmt
densumesh Jun 26, 2024
9ade962
bugfix: fix redoc errors
densumesh Jun 26, 2024
c7ad13a
feature: clickhouse operator and clustering cronjob
densumesh Jun 26, 2024
84a76e2
bugfix: clickhouse operator
densumesh Jun 26, 2024
55b7f86
feature: log events to clickhouse
densumesh Jun 21, 2024
81367c0
feat: analytics site
drew-harris Jun 25, 2024
04e58d0
feature: env var for analytics
densumesh Jun 27, 2024
d512d40
feature: fix tokio error
densumesh Jun 27, 2024
3639ab9
feature: error silently if clickhouse insert errors
densumesh Jun 27, 2024
5eb8dfc
feature: move to CH for events
densumesh Jun 27, 2024
3bb4889
bugfix: fix clickhouse error
densumesh Jun 27, 2024
7667d19
ops: docker images and CI actions
cdxker Jun 27, 2024
98486b2
feature: events in clickhouse
densumesh Jun 27, 2024
87de69d
fix: add solid query devtools
drew-harris Jun 27, 2024
68788e5
fix: better canvas
drew-harris Jun 27, 2024
be55a1a
feat: showing queries in sidebar
drew-harris Jun 27, 2024
5c1a933
fix: eslint
drew-harris Jun 27, 2024
0dcb310
bugfix: change to new database name
densumesh Jun 27, 2024
1565805
feat: show head queries
drew-harris Jun 27, 2024
97c4608
feat: pagination hook
drew-harris Jun 27, 2024
1abaa7d
feat: low confidence queries
drew-harris Jun 27, 2024
641c8fe
cleanup: change search type to top score
drew-harris Jun 27, 2024
761fc54
feat: remove chunk action and rename bulk chunk upload fail
drew-harris Jun 28, 2024
e51d9c4
feat: log delete chunks success / failure in clickhouse
drew-harris Jun 28, 2024
a34307c
fix: change database name in create_event_query from default -> trieve
drew-harris Jun 28, 2024
8658bc9
cleanup: fix database name and change to chunk count
drew-harris Jun 28, 2024
5f36d6e
bugfix: fix grupdate event
densumesh Jun 28, 2024
871a9ec
feature: delete from clickhouse when dataset is deleted
densumesh Jun 28, 2024
db88c34
bugfix: replace string interp with binds for clickhouse delete
densumesh Jun 28, 2024
a46f7d7
make envs be evaluated at runtime
cdxker Jul 1, 2024
624d2ab
ops: updated Dockerfiles to compile clickhouse crate
cdxker Jul 1, 2024
9513aa9
bugfix: make dashbaord events work again
densumesh Jul 1, 2024
cd4081a
bugfix: change event_data to be a string
densumesh Jul 1, 2024
ab2f2b7
bugfix: make events look better on the UI
densumesh Jul 1, 2024
cf71396
security: remove the ability to auto link user accounts
densumesh Jul 1, 2024
9a4b858
feature: cleanup Dockerfiles
cdxker Jul 1, 2024
5acc2e5
ops: group worker docker image + CI action
cdxker Jul 1, 2024
4399e34
bugfix: make analytics compile
densumesh Jul 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feature: access analytics from server routes
  • Loading branch information
densumesh authored and cdxker committed Jul 1, 2024
commit 4f51885671393830bad36c538b1d29032121c765
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ story_html.zip
testing.ipynb
output.json
temp.json
analytics/analytics-server/target
server/target
server/images
server/tantivy
Expand Down
202 changes: 202 additions & 0 deletions analytics/clustering-script/get_clusters.py
densumesh marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
from datetime import date
import datetime
import enum
import uuid
import anthropic
import clickhouse_connect
import clickhouse_connect.driver
import clickhouse_connect.driver.client
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cosine
import dotenv

dotenv.load_dotenv()
anthropic_client = anthropic.Anthropic()


# Function to fetch data from ClickHouse
def fetch_dataset_vectors(
client: clickhouse_connect.driver.client.Client, dataset_id: uuid.UUID, limit=5000
):
query = """
SELECT id, query, top_score, query_vector
FROM trieve.search_queries
WHERE dataset_id = '{}'
AND created_at >= now() - INTERVAL 7 DAY
ORDER BY rand()
LIMIT {}
""".format(
str(dataset_id),
limit,
)

vector_result = client.query(query)
rows = vector_result.result_rows

return rows


def get_datasets(client: clickhouse_connect.driver.client.Client):
query = """
SELECT DISTINCT dataset_id
FROM search_queries
"""

dataset_result = client.query(query)
rows = dataset_result.result_rows
return rows


def kmeans_clustering(data, n_clusters=10):
vectors = np.array([row[3] for row in data])
kmeans = KMeans(n_clusters=n_clusters, init="k-means++")
kmeans.fit(vectors)
return kmeans, vectors


# Function to find the closest queries to the centroids
def get_topics(kmeans, vectors, data, n_points=5):
centroids = kmeans.cluster_centers_
topics = []

for i, centroid in enumerate(centroids):
distances = [cosine(centroid, vector) for vector in vectors]
closest_indices = np.argsort(distances)[
: n_points + 1
] # include the centroid itself

for row in data:
if row[4] == i:
row.append(cosine(centroid, row[3]))

print([data[idx][1] for idx in closest_indices])

# Create a request to the ChatGPT model
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=50,
system="You are a data scientist. You have been tasked with clustering search queries into topics. You have just finished clustering a set of queries into a group. You have been asked to generate a 3-5 word topic name for this cluster. ONLY RETURN THE TOPIC AND NO OTHER CONTEXT OR WORDS",
messages=[
{
"role": "user",
"content": f"Here are some search queries from a cluster: {', '.join([data[idx][1] for idx in closest_indices])}",
},
],
)
# Get the response text
reply = response.content[0].text
# Extract the topic name
topics.append(reply)

return data, topics


def append_cluster_membership(data, kmeans):
labels = kmeans.labels_
for i, row in enumerate(data):
row = list(row)
row.append(labels[i])
data[i] = row
return data


def insert_centroids(
client: clickhouse_connect.driver.client.Client, data, dataset_id, topics
):
print(data[0][5])
cluster_ids_to_delete_query = """
SELECT id
FROM trieve.cluster_topics
WHERE dataset_id = '{}'
""".format(
str(dataset_id[0])
)
cluster_ids_to_delete = [
str(row[0]) for row in client.query(cluster_ids_to_delete_query).result_rows
]
print(cluster_ids_to_delete)

delete_previous_query = """
DELETE FROM trieve.cluster_topics
WHERE dataset_id = '{}'
""".format(
str(dataset_id[0])
)
client.query(delete_previous_query)
if len(cluster_ids_to_delete) > 0:
delete_previous_search_cluster_memberships_query = """
DELETE FROM trieve.search_cluster_memberships
WHERE cluster_id IN ('{}')
""".format(
"', '".join(cluster_ids_to_delete)
)
client.query(delete_previous_search_cluster_memberships_query)

topic_ids = [uuid.uuid4() for _ in range(len(topics))]

client.insert(
"cluster_topics",
[
[
topic_ids[i],
dataset_id[0],
topic,
len([row for row in data if len(row) == 6 and row[4] == i]),
np.mean([row[2] for row in data if len(row) == 6 and row[4] == i]),
datetime.datetime.now(),
]
for i, topic in enumerate(topics)
],
column_names=[
"id",
"dataset_id",
"topic",
"density",
"avg_score",
"created_at",
],
settings={
"async_insert": "1",
"wait_for_async_insert": "0",
},
)

client.insert(
"search_cluster_memberships",
[[uuid.uuid4(), row[0], topic_ids[row[4]], float(row[5])] for row in data],
settings={
"async_insert": "1",
"wait_for_async_insert": "0",
},
)


# Main script
if __name__ == "__main__":
# Connect to ClickHouse
client = clickhouse_connect.get_client(
host="localhost",
port=8123,
username="clickhouse",
password="password",
database="trieve",
)

dataset_ids = get_datasets(client)
for dataset_id in dataset_ids:
# Fetch data
data = fetch_dataset_vectors(client, dataset_id[0], 3000)

# Perform spherical k-means clustering
n_clusters = 15 # Change this to the desired number of clusters
kmeans, vectors = kmeans_clustering(data, n_clusters)

# Append cluster membership to the data
data = append_cluster_membership(data, kmeans)

# Find the closest queries to the centroids
data, topics = get_topics(kmeans, vectors, data)

# Insert the topics into the database
insert_centroids(client, data, dataset_id, topics)
125 changes: 0 additions & 125 deletions analytics/get_clusters.py

This file was deleted.

Loading