Musoq is handy tool that allows to use SQL syntax on a variety of data sources.
Musoq exposes raw data sets as queryable sources. This allows to search these data sources using SQL syntax variant. What can be used as query source? Virtually anything! Here are some ideas (many of them are already included in the project!):
- Directories
- Files
- Structured files (.csv, .json, .xml, logs)
- Photos (by exif attributes)
- Archived files (.zip)
- Git, Svn, TFS
- Websites (tables, lists)
- Processes
- Time
It is possible to mix sources between each other.
select * from #os.files('path/to/folder', false) where Extension = '.exe' or Extension = '.png'
or through reordered syntax:
from #os.files('path/to/folder', false) where Extension = '.exe' or Extension = '.png' select *
To run it, visit Musoq installation page. You will find there latest release with installation process description.
Yes, it does work on linux. I have tested it on Ubuntu 18.04. If you try to run it on different distro or version. I will be grateful if you would post an issue reporting either success or fail.
- Optional query reordering (from ... where ... group by ... having ... select ... skip N take N2)
- Use of
*
to select all columns. - Group by operator.
- Having operator.
- Skip & Take operators.
- Complex object accessing ability
column.Name
. - User defined functions and aggregation functions.
- Plugin API (to create your own custom data source).
- Set operators (non sql-like usage) (union, union all, except, intersect).
- Parametrizable sources.
- Like / not Like operator.
- RLike / not RLike operator (regex like operator).
- Contains operator (Doesn't support nested queries yet).
- CTE expressions.
- Desc for schema, schema table constructors and tables.
- In syntax.
- Inner, Left outer, Right outer join syntax.
SELECT
FullName
FROM #os.files('C:/Some/Path/To/Dir', true)
WHERE Extension = '.png' OR Extension = '.jpg'
SELECT
FullName
FROM #os.files('C:/Some/Path/To/Dir', true)
WHERE Extension IN ('.png', '.jpg')
SELECT
DirectoryName,
Sum(Length) / 1024 / 1024 as 'MB',
Min(Length) as 'Min',
Max(Length) as 'Max',
Count(FullName) as 'CountOfFiles',
FROM #os.files('', true)
GROUP BY DirectoryName
SELECT
*
FROM #os.files('', true)
WHERE Name like '%report%'
SELECT
FullName
FROM #os.files('E:/', true)
WHERE
IsAudio() AND
HasWordThatSoundLike(Name, 'material')
SELECT
ToHex(Head(5), '|'),
ToHex(Tail(5), '|'),
ToHex(GetFileBytes(10, 5), '|')
FROM #os.files('', false)
WITH filesOfA AS (
SELECT
GetRelativeName('E:\DiffDirsTests\A') AS FullName,
Sha256File() AS ShaedFile
FROM #os.files('E:\DiffDirsTests\A', true)
), filesOfB AS (
SELECT
GetRelativeName('E:\DiffDirsTests\B') AS FullName,
Sha256File() AS ShaedFile
FROM #os.files('E:\DiffDirsTests\B', true)
), inBothDirs AS (
SELECT
a.FullName AS FullName,
(
CASE WHEN a.ShaedFile = b.ShaedFile
THEN 'The Same'
ELSE 'Modified'
END
) AS Status
FROM filesOfA a INNER JOIN filesOfB b ON a.FullName = b.FullName
), inSourceDir AS (
SELECT
a.FullName AS FullName,
'Removed' AS Status
FROM filesOfA a LEFT OUTER JOIN filesOfB b ON a.FullName = b.FullName
), inDestinationDir AS (
SELECT
b.FullName AS FullName,
'Added' AS Status
FROM filesOfA a RIGHT OUTER JOIN filesOfB b ON a.FullName = b.FullName
)
SELECT
inBoth.FullName AS FullName,
inBoth.Status AS Status
FROM inBothDirs inBoth
UNION (FullName)
SELECT
inSource.FullName AS FullName,
inSource.Status AS Status
FROM inSourceDir inSource
UNION (FullName)
SELECT
inDest.FullName AS FullName,
inDest.Status AS Status
FROM inDestinationDir inDest
SELECT
(
CASE WHEN SourceFile IS NOT NULL
THEN SourceFileRelative
ELSE DestinationFileRelative
END
) AS FullName,
(
CASE WHEN State = 'TheSame'
THEN 'The Same'
ELSE State
END
) AS Status
FROM #os.dirscompare('E:\DiffDirsTests\A', 'E:\DiffDirsTests\B')
SELECT
DirectoryName,
AggregateValues(Name)
FROM #os.files('E:/', true)
WHERE IsZipArchive()
GROUP BY DirectoryName
SELECT
FullName
FROM #os.files('', true)
WHERE ToDecimal(Length) / 1024 / 1024 / 1024 > 1
SELECT
ocr.GetText(file.FullName) as text
FROM
#os.files('E:/Path/To/Directory', false) file
INNER JOIN
#ocr.single() ocr
ON 1 = 1
WHERE files.Extension = '.png'
SELECT Value FROM #system.range(1, 10)
There is a built-in way to list all the columns from a source, all plugins supports it out of the box! The command is: desc #schema.table(someArg1, someArg2)
.
You can easily plug-in your own data source. There is fairly simple plugin api that all sources use. To read in details how to do it, jump into wiki section of this repo click.
Tested on laptop with i7 7700HQ, 12 GB RAM, Windows 10, Main Disk (250 GB SSD), Secondary Disk (1TB HDD). Files were placed on the HDD. The query tested was counting how many rows the files has. The file tested was a single 6GB csv file with 11 columns. For each test the file was split to reflect sizes you can observe in chart. This should give you some guidance on what data processing rate you can expect using this tool.
On the one hand, I needed something that allowed me to perform queries on my own bank account file, at the same time something that filters with respect to file names and their content. I had the idea that I would like it to be a single tool rather than a set of tools. That's how the musoq was born in my mind, with extensible plugins system and user defined grouping operators. All that Musoq does, you can achieve by "hand writing" multiple scripts manually, however I found it useful to automate this process and as a result minimizing the amount of time to create it. Fast querying was my goal. Looking at it another way, you might see that Musoq transpiles SQL code into C# code and then compiles it with Roslyn. In that case, writing C# code is redundant when all you have to do is to write a query and it will do the magic with your data source.
As the language looks like sql, it doesn't mean it is fully SQL compliant. It uses SQL syntax and repeats some of it's behaviour however, some differences may appear. It will also implement some experimental syntax and behaviours that are not used by database engines.
This project is licensed under the MIT License - see the LICENSE file for details